Apache Spark Scalaでデータフレームをデータセットに変換するには？

Question

データフレームをデータセットに変換する必要があり、次のコードを使用しました。

_ val final_df = Dataframe.withColumn( "features", toVec4( // casting into Timestamp to parse the string, and then into Int $"time_stamp_0".cast(TimestampType).cast(IntegerType), $"count", $"sender_ip_1", $"receiver_ip_2" ) ).withColumn("label", (Dataframe("count"))).select("features", "label") final_df.show() val trainingTest = final_df.randomSplit(Array(0.3, 0.7)) val TrainingDF = trainingTest(0) val TestingDF=trainingTest(1) TrainingDF.show() TestingDF.show() ///lets create our liner regression val lir= new LinearRegression() .setRegParam(0.3) .setElasticNetParam(0.8) .setMaxIter(100) .setTol(1E-6) case class df_ds(features:Vector, label:Integer) org.Apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this) val Training_ds = TrainingDF.as[df_ds] _

私の問題は、私は次のエラーを受け取りました：

_Error:(96, 36) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. val Training_ds = TrainingDF.as[df_ds] _

データフレームの値の数は、私のクラスの値の数とは異なるようです。ただし、TrainingDFデータフレームでcase class df_ds(features:Vector, label:Integer)を使用しています。これは、フィーチャのベクトルと整数ラベルを持っているためです。 TrainingDFデータフレームは次のとおりです。

_+--------------------+-----+ | features|label| +--------------------+-----+ |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,10...| 10| +--------------------+-----+ _

また、ここに私のオリジナルfinal_dfデータフレームがあります：

_+------------+-----------+-------------+-----+ |time_stamp_0|sender_ip_1|receiver_ip_2|count| +------------+-----------+-------------+-----+ | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.3| 10.0.0.2| 10| +------------+-----------+-------------+-----+ _

しかし、前述のエラーが発生しました！誰も私を助けることができますか？前もって感謝します。

stefanobaghino · Answer

あなたが読んでいるエラーメッセージはかなり良いポインタです。

DataFrameをDatasetに変換するときは、Encoder行に格納されているものすべてに対して適切なDataFrameが必要です。

プリミティブ型のエンコーダー（Ints、Stringsなど）およびcase classesは、次のようにSparkSessionの暗黙をインポートするだけで提供されます。

case class MyData(intField: Int, boolField: Boolean) // e.g. val spark: SparkSession = ??? val df: DataFrame = ??? import spark.implicits._ val ds: Dataset[MyData] = df.as[MyData]

それがうまくいかない場合は、あなたがしようとしているタイプがcastであるため、DataFrame toはサポートされていません。その場合、独自のEncoderを記述する必要があります。詳細については here を参照し、例を参照してください（Java.time.LocalDateTimeのEncoder）- ここ。

Shang Gao · Answer

Spark 1.6.

case class MyCase(id: Int, name: String) val encoder = org.Apache.spark.sql.catalyst.encoders.ExpressionEncoder[MyCase] val dataframe = … val dataset = dataframe.as(encoder)

Spark 2.0以降

case class MyCase(id: Int, name: String) val encoder = org.Apache.spark.sql.Encoders.product[MyCase] val dataframe = … val dataset = dataframe.as(encoder)