web-dev-qa-db-ja.com

Sparkジョブが「終了コード:52」で失敗するのはなぜですか?

私はSparkジョブが次のようなトレースで失敗しました:

./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-Container id: container_1455622885057_0016_01_000008
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-Exit code: 52
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr:Stack trace: ExitCodeException exitCode=52: 
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.Apache.hadoop.util.Shell.runCommand(Shell.Java:545)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.Apache.hadoop.util.Shell.run(Shell.Java:456)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.Apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.Java:722)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.Apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.Java:211)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.Apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.Java:302)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.Apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.Java:82)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at Java.util.concurrent.FutureTask.run(FutureTask.Java:262)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1145)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:615)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at Java.lang.Thread.run(Thread.Java:745)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-Container exited with a non-zero exit code 52

「終了コード52」の意味を理解するのに少し時間がかかったので、検索している他の人のためにここに置いておきます。

16
Virgil

終了コード52はorg.Apache.spark.util.SparkExitCodeからのもので、val OOM=52-つまり、OutOfMemoryError。これは私がコンテナログでもこれを見つけるので意味があります:

16/02/16 17:09:59 ERROR executor.Executor: Managed memory leak detected; size = 4823704883 bytes, TID = 3226
16/02/16 17:09:59 ERROR executor.Executor: Exception in task 26.0 in stage 2.0 (TID 3226)
Java.lang.OutOfMemoryError: Unable to acquire 1248 bytes of memory, got 0
        at org.Apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.Java:120)
        at org.Apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.Java:354)
        at org.Apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.Java:375)
        at org.Apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.Java:237)
        at org.Apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.Java:164)
        at org.Apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.Apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.Apache.spark.scheduler.Task.run(Task.scala:89)
        at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1145)
        at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:615)
        at Java.lang.Thread.run(Thread.Java:745)

(問題がコードにあるのか、Tungstenのメモリリークにあるのかは、現時点ではわかりませんが、別の問題です)

23
Virgil