Spark SQL SaveMode.Overwrite、get Java.io.FileNotFoundException and require 'REFRESH TABLE tableName'

Question

spark sqlの場合、HDFSの1つのフォルダーからデータをフェッチし、いくつかの変更を行い、更新されたデータをHDFSの同じフォルダーに保存する方法上書き保存モードを使用 FileNotFoundExceptionを取得せずに？

import org.Apache.spark.sql.{SparkSession,SaveMode} import org.Apache.spark.SparkConf val sparkConf: SparkConf = new SparkConf() val sparkSession = SparkSession.builder.config(sparkConf).getOrCreate() val df = sparkSession.read.parquet("hdfs://xxx.xxx.xxx.xxx:xx/test/d=2017-03-20") val newDF = df.select("a","b","c") newDF.write.mode(SaveMode.Overwrite) .parquet("hdfs://xxx.xxx.xxx.xxx:xx/test/d=2017-03-20") // doesn't work newDF.write.mode(SaveMode.Overwrite) .parquet("hdfs://xxx.xxx.xxx.xxx:xx/test/d=2017-03-21") // works

FileNotFoundExceptionは、hdfsディレクトリ "d = 2017-03-20"からデータを読み取り、更新されたデータを（SaveMode.Overwrite）同じhdfsディレクトリ "d = 2017-03-20"に保存すると発生します。

Caused by: org.Apache.spark.SparkException: Task failed while writing rows at org.Apache.spark.sql.execution.datasources.FileFormatWriter$.org$Apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.Apache.spark.scheduler.Task.run(Task.scala:99) at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142) at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617) at Java.lang.Thread.run(Thread.Java:745) Caused by: Java.io.FileNotFoundException: File does not exist: hdfs://xxx.xxx.xxx.xxx:xx/test/d=2017-03-20/part-05020-35ea100f-829e-43d9-9003061-1788904de770.snappy.parquet It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.Apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:157) at org.Apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) at org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.Apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.Java:43) at org.Apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.Apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$Apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.Apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$Apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.Apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.Apache.spark.sql.execution.datasources.FileFormatWriter$.org$Apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more

次の試行でも同じエラーが発生しますが、spark sqlを使用してこの問題を解決するにはどうすればよいですか？ありがとうございます！

val hdfsDirPath = "hdfs://xxx.xxx.xxx.xxx:xx/test/d=2017-03-20" val df= sparkSession.read.parquet(hdfsDirPath) val newdf = df newdf.write.mode(SaveMode.Overwrite).parquet(hdfsDirPath)

または

val df= sparkSession.read.parquet(hdfsDirPath) df.createOrReplaceTempView("orgtable") sparkSession.sql("SELECT * from orgtable").createOrReplaceTempView("tmptable") sparkSession.sql("TRUNCATE TABLE orgtable") sparkSession.sql("INSERT INTO orgtable SELECT * FROM tmptable") val newdf = sparkSession.sql("SELECT * FROM orgtable") newdf.write.mode(SaveMode.Overwrite).parquet(hdfsDirPath)

または

val df= sparkSession.read.parquet(hdfsDirPath) df.createOrReplaceTempView("orgtable") sparkSession.sql("SELECT * from orgtable").createOrReplaceTempView("tmptable") sparkSession.sql("REFRESH TABLE orgtable") sparkSession.sql("ALTER VIEW tmptable RENAME TO orgtable") val newdf = sparkSession.sql("SELECT * FROM orgtable") newdf.write.mode(SaveMode.Overwrite).parquet(hdfsDirPath)

廖梓帆 · Answer

私はこれを解決しました。まず、Dataframeを一時ディレクトリに書き込み、読み取っているソースを削除して、一時ディレクトリの名前をソース名に変更します。 QAQ

uh_big_mike_boi · Answer

読んだ後、キャッシュしてみませんか？別のファイルディレクトリに保存してからディレクトリを移動すると、追加の権限が必要になる場合があります。 show（）のようなアクションも強制的に実行しています。

val myDF = spark.read.format("csv") .option("header", "false") .option("delimiter", ",") .load("/directory/tofile/") myDF.cache() myDF.show(2)

Jake · Answer

val dfOut = df.filter(r => r.getAs[Long]("dsctimestamp") > (System.currentTimeMillis() - 1800000))

上記のコード行では、dfに基礎となるHadoopパーティションがありました。この変換を行った後（つまり、dfOutに）、dfOutがガベージコレクションされるまで、基になるパーティションを削除、名前変更、または上書きする方法を見つけることができませんでした。

私の解決策は、古いパーティションを保持し、dfOutの新しいパーティションを作成し、新しいパーティションに現在のフラグを立て、dfOutがガベージコレクションされた後、しばらくして古いパーティションを削除することでした。

理想的なソリューションではない可能性があります。私は、この問題に対処するためのそれほど曲がらない方法を学びたいです。しかし、それは機能します。

Sarath Avanavu · Answer

私は同様の問題に直面しました。以下のコードを使用してHiveテーブルにデータフレームを書き込んでいました

dataframe.write.mode("overwrite").saveAsTable("mydatabase.tablename")

このテーブルをクエリしようとすると、同じエラーが発生しました。次に、テーブルを作成した後、次のコード行を追加してテーブルを更新し、問題を解決しました。

spark.catalog.refreshTable("mydatabase.tablename")