web-dev-qa-db-ja.com

Apacheを使用してRDDをテキストファイルとして書き込むSpark

私はSparkバッチ処理用に探索しています。スタンドアロンモードを使用してローカルマシンでsparkを実行しています。

SaveTextFile()メソッドを使用してSpark RDDを単一ファイル[最終出力]として変換しようとしていますが、機能しません。

たとえば、複数のパーティションがある場合、最終出力として1つのファイルを取得するにはどうすればよいですか。

更新:

以下の方法を試しましたが、nullポインタ例外が発生します。

person.coalesce(1).toJavaRDD().saveAsTextFile("C://Java_All//output");
person.repartition(1).toJavaRDD().saveAsTextFile("C://Java_All//output");

例外は次のとおりです。

    15/06/23 18:25:27 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/06/23 18:25:27 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/06/23 18:25:27 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/06/23 18:25:27 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/06/23 18:25:27 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/06/23 18:25:27 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
Java.lang.NullPointerException
    at Java.lang.ProcessBuilder.start(ProcessBuilder.Java:1012)
    at org.Apache.hadoop.util.Shell.runCommand(Shell.Java:404)
    at org.Apache.hadoop.util.Shell.run(Shell.Java:379)
    at org.Apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.Java:589)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:678)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:661)
    at org.Apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.Java:639)
    at org.Apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.Java:468)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:456)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:424)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:905)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:798)
    at org.Apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.Java:123)
    at org.Apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
    at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.Apache.spark.scheduler.Task.run(Task.scala:70)
    at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
    at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
    at Java.lang.Thread.run(Thread.Java:745)
15/06/23 18:25:27 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): Java.lang.NullPointerException
    at Java.lang.ProcessBuilder.start(ProcessBuilder.Java:1012)
    at org.Apache.hadoop.util.Shell.runCommand(Shell.Java:404)
    at org.Apache.hadoop.util.Shell.run(Shell.Java:379)
    at org.Apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.Java:589)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:678)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:661)
    at org.Apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.Java:639)
    at org.Apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.Java:468)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:456)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:424)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:905)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:798)
    at org.Apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.Java:123)
    at org.Apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
    at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.Apache.spark.scheduler.Task.run(Task.scala:70)
    at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
    at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
    at Java.lang.Thread.run(Thread.Java:745)

15/06/23 18:25:27 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
15/06/23 18:25:27 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
15/06/23 18:25:27 INFO TaskSchedulerImpl: Cancelling stage 1
15/06/23 18:25:27 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at TestSpark.Java:40) failed in 0.249 s
15/06/23 18:25:28 INFO DAGScheduler: Job 0 failed: saveAsTextFile at TestSpark.Java:40, took 0.952286 s
Exception in thread "main" org.Apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): Java.lang.NullPointerException
    at Java.lang.ProcessBuilder.start(ProcessBuilder.Java:1012)
    at org.Apache.hadoop.util.Shell.runCommand(Shell.Java:404)
    at org.Apache.hadoop.util.Shell.run(Shell.Java:379)
    at org.Apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.Java:589)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:678)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:661)
    at org.Apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.Java:639)
    at org.Apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.Java:468)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:456)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:424)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:905)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:798)
    at org.Apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.Java:123)
    at org.Apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
    at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.Apache.spark.scheduler.Task.run(Task.scala:70)
    at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
    at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
    at Java.lang.Thread.run(Thread.Java:745)

Driver stacktrace:
    at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
    at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/06/23 18:25:28 INFO SparkContext: Invoking stop() from shutdown hook
15/06/23 18:25:28 INFO SparkUI: Stopped Spark web UI at http://10.37.145.179:4040
15/06/23 18:25:28 INFO DAGScheduler: Stopping DAGScheduler
15/06/23 18:25:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/06/23 18:25:28 INFO Utils: path = C:\Users\crh537\AppData\Local\Temp\spark-a52371d8-ae6a-4567-b759-0a6c66c1908c\blockmgr-4d17a5b4-c8f8-4408-af07-0e88239794e8, already present as root for deletion.
15/06/23 18:25:28 INFO MemoryStore: MemoryStore cleared
15/06/23 18:25:28 INFO BlockManager: BlockManager stopped
15/06/23 18:25:28 INFO BlockManagerMaster: BlockManagerMaster stopped
15/06/23 18:25:28 INFO SparkContext: Successfully stopped SparkContext
15/06/23 18:25:28 INFO Utils: Shutdown hook called

よろしく、シャンカール

10
Shankar

coalesceメソッドを使用して、単一のファイルに保存できます。これにより、コードは次のようになります。

val myFile = sc.textFile("file.txt")
val finalRdd = doStuff(myFile)
finalRdd.coalesce(1).saveAsTextFile("newfile")

同じことを行う別の方法repartitionもありますが、それは非常に高価なシャッフルを引き起こしますが、合体はシャッフルを回避しようとします。

10
Maksud

これをWindowsで実行していますか?はいの場合、次の行を追加する必要があります

System.setProperty("hadoop.home.dir", "C:\\winutil\\")

次のリンクからwinutilsをダウンロードできます。

http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

12
Harvinder Singh

Sparkは内部的にhadoopファイルシステムを使用しているため、filestemに対して読み書きしようとすると、最初にbin\winutils.exeを含むHADOOP_HOME構成フォルダーが検索されます。これを設定していない可能性があります。それがスローするnullポインタの理由です。

0
Arjun gangineni