Spark各パーティションが同じ数の要素を持つ同じサイズのパーティションのRDDのカスタムパーティショナーを定義するには？

Question

Sparkは初めてです。要素の大規模なデータセット[RDD]があり、要素の順序を維持しながら、それを正確に同じサイズの2つのパーティションに分割します。 RangePartitionerのようなものを使ってみました

var data = partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile))

これは、要素の順序を維持しながらおおまかに分割しますが、正確に等しいサイズではないため、満足のいく結果は得られません。たとえば、64個の要素がある場合、Rangepartitionerを使用し、31個の要素と33個の要素に分割します。

半分に正確に最初の32個の要素を取得し、残りの半分に32個の要素の2番目のセットが含まれるようなパーティショナーが必要です。カスタマイズされたパーティショナーを使用して、要素の順序を維持しながら同じサイズの2つの半分になるようにする方法を提案してください。

Daniel Darabos · Answer

Partitionersは、キーをパーティションに割り当てることで機能します。このようなパーティショナーを作成するには、キーの配布に関する事前知識が必要です。または、すべてのキーを確認する必要があります。 Sparkは提供しません。

通常、このようなパーティショナーは必要ありません。実際、同じサイズのパーティションが必要なユースケースを思い付くことができません。要素の数が奇数の場合はどうなりますか？

とにかく、シーケンシャルIntsでキー設定されたRDDがあり、合計でいくつあるかを知っているとしましょう。次に、次のようにカスタムPartitionerを記述できます。

class ExactPartitioner[V]( partitions: Int, elements: Int) extends Partitioner { def getPartition(key: Any): Int = { val k = key.asInstanceOf[Int] // `k` is assumed to go continuously from 0 to elements-1. return k * partitions / elements } }

samthebest · Answer

この回答にはダニエルからのインスピレーションがありますが、完全な実装（pimp my library pattern）を使用して、人々のコピーと貼り付けのニーズの例を示します:)

import RDDConversions._ trait RDDWrapper[T] { def rdd: RDD[T] } // TODO View bounds are deprecated, should use context bounds // Might need to change ClassManifest for ClassTag in spark 1.0.0 case class RichPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest]( rdd: RDD[(K, V)]) extends RDDWrapper[(K, V)] { // Here we use a single Long to try to ensure the sort is balanced, // but for really large dataset, we may want to consider // using a Tuple of many Longs or even a GUID def sortByKeyGrouped(numPartitions: Int): RDD[(K, V)] = rdd.map(kv => ((kv._1, Random.nextLong()), kv._2)).sortByKey() .grouped(numPartitions).map(t => (t._1._1, t._2)) } case class RichRDD[T: ClassManifest](rdd: RDD[T]) extends RDDWrapper[T] { def grouped(size: Int): RDD[T] = { // TODO Version where withIndex is cached val withIndex = rdd.mapPartitions(_.zipWithIndex) val startValues = withIndex.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.toIterable.last))).toArray().toList .sortBy(_._1).map(_._2._2.toLong).scan(-1L)(_ + _).map(_ + 1L) withIndex.mapPartitionsWithIndex((i, iter) => iter.map { case (value, index) => (startValues(i) + index.toLong, value) }) .partitionBy(new Partitioner { def numPartitions: Int = size def getPartition(key: Any): Int = (key.asInstanceOf[Long] * numPartitions.toLong / startValues.last).toInt }) .map(_._2) } }

次に、別のファイルに

// TODO modify above to be implicit class, rather than have implicit conversions object RDDConversions { implicit def toRichRDD[T: ClassManifest](rdd: RDD[T]): RichRDD[T] = new RichRDD[T](rdd) implicit def toRichPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest]( rdd: RDD[(K, V)]): RichPairRDD[K, V] = RichPairRDD(rdd) implicit def toRDD[T](rdd: RDDWrapper[T]): RDD[T] = rdd.rdd }

次に、使用したいユースケースのために（既にソートされていると仮定して）

import RDDConversions._ yourRdd.grouped(2)

免責事項：テストされていません、ちょっとSO answer