Spark：RDDが空かどうかをテストする効率的な方法

Question

RDDにはisEmptyメソッドがないので、RDDが空かどうかをテストする最も効率的な方法は何ですか？

Tobber · Accepted Answer

RDD.isEmpty() は、Spark 1.3.0の一部になります。

this Apache mail-thread の提案とこの回答へのいくつかのコメントに基づいて、私はいくつかの小さなローカル実験を行いました。最適な方法は、take(1).length==0を使用することです。

_def isEmpty[T](rdd : RDD[T]) = { rdd.take(1).length == 0 } _

RDDが空の場合を除き、O(1)で実行する必要があります。空の場合は、パーティションの数が線形になります。

これを指摘してくれたJosh RosenとNick Chammasに感謝します。

注：RDDのタイプが_RDD[Nothing]_の場合、これは失敗します。 isEmpty(sc.parallelize(Seq()))ですが、これは実際には問題ではないでしょう。 isEmpty(sc.parallelize(Seq[Any]()))は正常に動作します。

編集：

編集1：コメントのおかげで、take(1)==0メソッドが追加されました。

元々の提案：mapPartitionsを使用します。

_def isEmpty[T](rdd : RDD[T]) = { rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_) } _

パーティションの数でスケーリングする必要があり、take(1)ほどクリーンではありません。ただし、_RDD[Nothing]_型のRDDには堅牢です。

実験：

このコードをタイミングに使用しました。

_def time(n : Long, f : (RDD[Long]) => Boolean): Unit = { val start = System.currentTimeMillis() val rdd = sc.parallelize(1L to n, numSlices = 100) val result = f(rdd) printf("Time: " + (System.currentTimeMillis() - start) + " Result: " + result) } time(1000000000L, rdd => rdd.take(1).length == 0L) time(1000000000L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)) time(1000000000L, rdd => rdd.count() == 0L) time(1000000000L, rdd => rdd.takeSample(true, 1).isEmpty) time(1000000000L, rdd => rdd.fold(0)(_ + _) == 0L) time(1L, rdd => rdd.take(1).length == 0L) time(1L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)) time(1L, rdd => rdd.count() == 0L) time(1L, rdd => rdd.takeSample(true, 1).isEmpty) time(1L, rdd => rdd.fold(0)(_ + _) == 0L) time(0L, rdd => rdd.take(1).length == 0L) time(0L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)) time(0L, rdd => rdd.count() == 0L) time(0L, rdd => rdd.takeSample(true, 1).isEmpty) time(0L, rdd => rdd.fold(0)(_ + _) == 0L) _

3つのワーカーコアを備えたローカルマシンでこれらの結果が得られました

_Time: 21 Result: false Time: 75 Result: false Time: 8664 Result: false Time: 18266 Result: false Time: 23836 Result: false Time: 113 Result: false Time: 101 Result: false Time: 68 Result: false Time: 221 Result: false Time: 46 Result: false Time: 79 Result: true Time: 93 Result: true Time: 79 Result: true Time: 100 Result: true Time: 64 Result: true _

marios · Answer

Spark 1. の時点で、isEmpty()はRDD APIの一部です。 isEmptyが失敗する原因となっていた修正は、後で Spark 1.4 で修正されました。

DataFramesの場合：

val df: DataFrame = ... df.rdd.isEmpty()

ここに、RDD実装からのコードの貼り付けがあります（1.4.1時点）。

 /** * @note due to complications in the internal implementation, this method will raise an * exception if called on an RDD of `Nothing` or `Null`. This may be come up in practice * because, for example, the type of `parallelize(Seq())` is `RDD[Nothing]`. * (`parallelize(Seq())` should be avoided anyway in favor of `parallelize(Seq[T]())`.) * @return true if and only if the RDD contains no elements at all. Note that an RDD * may be empty even when it has at least 1 partition. */ def isEmpty(): Boolean = withScope { partitions.length == 0 || take(1).length == 0 }