web-dev-qa-db-ja.com

構造化ストリーミングは/_spark_metadata/9.compactが存在しないことを引用してファイルシンクにDFを書き込みません

EMR 5.11.1、Kafka 2.2.1でSpark取り込みモジュールを構築しています。私の意図は、構造化ストリーミングを使用してKafkaトピックから消費し、処理を行い、寄木細工の形式でEMRFS/S3に保存することです。

コンソールシンクは期待どおりに機能しますが、ファイルシンクは機能しません。

spark-Shell

val event = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()

val eventdf = event.select($"value" cast "string" as "json")
.select(from_json($"json", readSchema) as "data")
.select("data.*")

val outputdf = <some processing on eventdf>

これは機能します:

val console_query = outputdf.writeStream.format("console")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start 

これはしません:

val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start //fails

私が試したがうまくいかなかったこと:

  1. sc.hadoopConfiguration.set( "parquet.enable.summary-metadata"、 "false")
  2. フォーマットを寄木細工の代わりにCSVに変更します
  3. 出力モードを完了に変更します(追加のみがサポートされています)
  4. 異なるトリガー間隔
  5. readStreamの.option( "failOnDataLoss"、false)

ソースコードを掘り下げてみると、ここにたどり着きました: https://github.com/Apache/spark/blob/master/sql/core/src/main/scala/org/Apache/spark/sql/execution/ Streaming/CompactibleFileStreamLog.scala ここで、.compactファイルがない場合はデフォルトがトリガーされるはずです。

したがって、試してみました:sp​​ark.conf.set( "spark.sql.streaming.fileSink.log.cleanupDelay"、60000)新しいバッチが結合されたメタデータファイルを作成する前に、古いバッチのメタデータが削除されていないことを確認します

このエラーを煩わしいものにしているのは、常に再現できるとは限らないということです。コード内の1文字を変更しないと、寄木細工の床への書き込みが機能する場合と機能しない場合があります。 spark内部の「状態」がこの問題の原因である場合に備えて、チェックポイントの場所、spark/hdfsログなどをクリーンアップしてみました。

エラースタックトレースは次のとおりです。

query: org.Apache.spark.sql.streaming.StreamingQuery = org.Apache.spark.sql.execution.streaming.StreamingQueryWrapper@56122c1

18/04/09 20:20:04 ERROR FileFormatWriter: Aborting job null.
Java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:174)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:174)
        at scala.Option.getOrElse(Option.scala:121)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4.apply(CompactibleFileStreamLog.scala:173)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4.apply(CompactibleFileStreamLog.scala:172)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog.compact(CompactibleFileStreamLog.scala:172)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog.add(CompactibleFileStreamLog.scala:156)
        at org.Apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:64)
        at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:207)
        at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
        at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
        at org.Apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at org.Apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
        at org.Apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:123)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:666)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:666)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:666)
        at org.Apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
        at org.Apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.Apache.spark.sql.execution.streaming.StreamExecution.org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:665)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:306)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
        at org.Apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
        at org.Apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
        at org.Apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
        at org.Apache.spark.sql.execution.streaming.StreamExecution.org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:290)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
18/04/09 20:20:04 ERROR StreamExecution: Query [id = 5251fe93-2b6b-4dff-bec3-7801dc7e6417, runId = 083547c1-69b7-40e7-8bf9-3c3af11d4c31] terminated with error
org.Apache.spark.SparkException: Job aborted.
        at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:213)
        at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
        at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
        at org.Apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at org.Apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
        at org.Apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:123)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:666)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:666)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:666)
        at org.Apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
        at org.Apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.Apache.spark.sql.execution.streaming.StreamExecution.org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:665)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:306)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
        at org.Apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
        at org.Apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
        at org.Apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
        at org.Apache.spark.sql.execution.streaming.StreamExecution.org$Apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:290)
        at org.Apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
Caused by: Java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:174)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:174)
        at scala.Option.getOrElse(Option.scala:121)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4.apply(CompactibleFileStreamLog.scala:173)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4.apply(CompactibleFileStreamLog.scala:172)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog.compact(CompactibleFileStreamLog.scala:172)
        at org.Apache.spark.sql.execution.streaming.CompactibleFileStreamLog.add(CompactibleFileStreamLog.scala:156)
        at org.Apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:64)
        at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:207)
        ... 20 more
12
maverik

S3は、Sparkチェックポインティングに必要な書き込み後の読み取りセマンティクスをサポートしていないことがわかりました。

この記事 チェックポイントに AWS EFS を使用することをお勧めします。

S3は、データを取り込んだり、データを取り込みたりするのに適した場所です。

2
Benedetto

チェックポイントパスをクリアすることで、この質問を解決しました。

  1. チェックポイントパスを削除します。

    Sudo -u hdfs hdfs dfs -rmr ${your_checkpoint_path}

  2. sparkジョブを再送信します。

0
user2894829