Spark：プログラムでscalaでデータフレームスキーマを作成する

Question

Sparkジョブの結果となる小さなデータセットがあります。このデータセットをジョブの終了時に便宜上データフレームに変換することを考えていますが、スキーマ：問題は下の最後のフィールド（topValues）で、タプル（キーとカウント）のArrayBufferです。

_ val innerSchema = StructType( Array( StructField("value", StringType), StructField("count", LongType) ) ) val outputSchema = StructType( Array( StructField("name", StringType, nullable=false), StructField("index", IntegerType, nullable=false), StructField("count", LongType, nullable=false), StructField("empties", LongType, nullable=false), StructField("nulls", LongType, nullable=false), StructField("uniqueValues", LongType, nullable=false), StructField("mean", DoubleType), StructField("min", DoubleType), StructField("max", DoubleType), StructField("topValues", innerSchema) ) ) val result = stats.columnStats.map{ c => Row(c._2.name, c._1, c._2.count, c._2.empties, c._2.nulls, c._2.uniqueValues, c._2.mean, c._2.min, c._2.max, c._2.topValues.topN) } val rdd = sc.parallelize(result.toSeq) val outputDf = sqlContext.createDataFrame(rdd, outputSchema) outputDf.show() _

私が得ているエラーはMatchErrorです：scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)

オブジェクトをデバッグして検査すると、次のように表示されます。

_rdd: ParallelCollectionRDD[2] rdd.data: "ArrayBuffer" size = 2 rdd.data(0): [age,2,6,0,0,3,14.666666666666666,8.0,20.0,ArrayBuffer((10,2), (20,3), (8,1))] rdd.data(1): [gender,3,6,0,0,2,0.0,0.0,0.0,ArrayBuffer((M,4), (F,2))] _

InnerSchemaでタプルのArrayBufferを正確に記述したようですが、Sparkは同意しません。

スキーマをどのように定義すべきか考えていますか？

David Griffin · Accepted Answer

val rdd = sc.parallelize(Array(Row(ArrayBuffer(1,2,3,4)))) val df = sqlContext.createDataFrame( rdd, StructType(Seq(StructField("arr", ArrayType(IntegerType, false), false) ) df.printSchema root |-- arr: array (nullable = false) | |-- element: integer (containsNull = false) df.show +------------+ | arr| +------------+ |[1, 2, 3, 4]| +------------+

Stuart · Answer

Davidが指摘したように、ArrayTypeを使用する必要がありました。 Sparkはこれに満足しています：

 val outputSchema = StructType( Array( StructField("name", StringType, nullable=false), StructField("index", IntegerType, nullable=false), StructField("count", LongType, nullable=false), StructField("empties", LongType, nullable=false), StructField("nulls", LongType, nullable=false), StructField("uniqueValues", LongType, nullable=false), StructField("mean", DoubleType), StructField("min", DoubleType), StructField("max", DoubleType), StructField("topValues", ArrayType(StructType(Array( StructField("value", StringType), StructField("count", LongType) )))) ) )

Arun Goudar · Answer

import spark.implicits._ import org.Apache.spark.sql.types._ import org.Apache.spark.sql.functions._ val searchPath = "/path/to/.csv" val columns = "col1,col2,col3,col4,col5,col6,col7" val fields = columns.split(",").map(fieldName => StructField(fieldName, StringType, nullable = true)) val customSchema = StructType(fields) var dfPivot =spark.read.format("com.databricks.spark.csv").option("header","false").option("inferSchema", "false").schema(customSchema).load(searchPath)

カスタムスキーマを使用してデータをロードすると、デフォルトスキーマを使用してデータをロードする場合に比べてはるかに高速になります。