SparkおよびScalaで例外を処理する方法

Question

.map操作がデータのすべての要素で正しく機能しない、FileNotFound例外など、Sparkで一般的な例外を処理しようとしています。既存の質問と次の2つの投稿をすべて読みました。

https://rcardin.github.io/big-data/Apache-spark/scala/programming/2016/09/25/try-again-Apache-spark.html

https://www.nicolaferraro.me/2016/02/18/exception-handling-in-Apache-spark

行attributes => mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble内でTryステートメントを試しました
したがって、attributes => Try(mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble)

しかし、それはコンパイルされません。コンパイラーは後で.toDF()ステートメントを認識しません。 JavaのようなTry {Catch {}}ブロックも試しましたが、スコープを正しく設定できません。 dfは返されません。これを正しく行う方法を誰かが知っていますか？ Sparkフレームワークは、FileNotFound例外を追加せずに既に処理しているようですが、これらの例外も処理する必要があります。しかし、たとえば、入力ファイルの列数が間違っている場合のスキーマ。

これがコードです：

object DataLoadTest extends SparkSessionWrapper { /** Helper function to create a DataFrame from a textfile, re-used in subsequent tests */ def createDataFrame(fileName: String): DataFrame = { import spark.implicits._ //try { val df = spark.sparkContext .textFile("/path/to/file" + fileName) .map(_.split("\t")) //mHealth user is the case class which defines the data schema .map(attributes => mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble, attributes(3).toDouble, attributes(4).toDouble, attributes(5).toDouble, attributes(6).toDouble, attributes(7).toDouble, attributes(8).toDouble, attributes(9).toDouble, attributes(10).toDouble, attributes(11).toDouble, attributes(12).toDouble, attributes(13).toDouble, attributes(14).toDouble, attributes(15).toDouble, attributes(16).toDouble, attributes(17).toDouble, attributes(18).toDouble, attributes(19).toDouble, attributes(20).toDouble, attributes(21).toDouble, attributes(22).toDouble, attributes(23).toInt)) .toDF() .cache() df } catch { case ex: FileNotFoundException => println(s"File $fileName not found") case unknown: Exception => println(s"Unknown exception: $unknown") } }

すべての提案に感謝します。ありがとう！

Neeraj Malhotra · Accepted Answer

別のオプションは、scalaで Try typeを使用することです。

例えば：

def createDataFrame(fileName: String): Try[DataFrame] = { try { //create dataframe df Success(df) } catch { case ex: FileNotFoundException => { println(s"File $fileName not found") Failure(ex) } case unknown: Exception => { println(s"Unknown exception: $unknown") Failure(unknown) } } }

ここで、呼び出し側で、次のように処理します。

createDataFrame("file1.csv") match { case Success(df) => { // proceed with your pipeline } case Failure(ex) => //handle exception }

これは、呼び出し側が失敗の理由を知っており、より適切に処理できるため、Optionを使用するよりもわずかに優れています。

Raphael Roth · Answer

createDataFrameメソッドから例外をスローするようにして（そしてそれを外部で処理して）、署名を変更してOption[DataFrame]を返すようにします。

 def createDataFrame(fileName: String): Option[DataFrame] = { import spark.implicits._ try { val df = spark.sparkContext .textFile("/path/to/file" + fileName) .map(_.split("\t")) //mHealth user is the case class which defines the data schema .map(attributes => mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble, attributes(3).toDouble, attributes(4).toDouble, attributes(5).toDouble, attributes(6).toDouble, attributes(7).toDouble, attributes(8).toDouble, attributes(9).toDouble, attributes(10).toDouble, attributes(11).toDouble, attributes(12).toDouble, attributes(13).toDouble, attributes(14).toDouble, attributes(15).toDouble, attributes(16).toDouble, attributes(17).toDouble, attributes(18).toDouble, attributes(19).toDouble, attributes(20).toDouble, attributes(21).toDouble, attributes(22).toDouble, attributes(23).toInt)) .toDF() .cache() Some(df) } catch { case ex: FileNotFoundException => { println(s"File $fileName not found") None } case unknown: Exception => { println(s"Unknown exception: $unknown") None } } }

編集：createDataFrameの呼び出し側にはいくつかのパターンがあります。複数のファイル名を処理している場合は、たとえば、行う：

 val dfs : Seq[DataFrame] = Seq("file1","file2","file3").map(createDataFrame).flatten

単一のファイル名で作業している場合は、次のことができます。

createDataFrame("file1.csv") match { case Some(df) => { // proceed with your pipeline val df2 = df.filter($"activityLabel" > 0).withColumn("binaryLabel", when($"activityLabel".between(1, 3), 0).otherwise(1)) } case None => println("could not create dataframe") }

Rajiv Singh · Answer

データフレーム列にtryおよびcatchブロックを適用します。

(try{$"credit.amount"} catch{case e:Exception=> lit(0)}).as("credit_amount")