web-dev-qa-db-ja.com

Apache Spark Codegen Stageが64 KBを超える

30以上の列で機能エンジニアリングを実行して約200以上の列を作成すると、エラーが発生します。ジョブは失敗していませんが、エラーが表示されます。どうすればこれを回避できるか知りたいです。

スパーク-2.3.1 Python-3.6

クラスター構成-1マスター-32 GB RAM、16コア4スレーブ-16 GB RAM、8コア

入力データ-snappy圧縮を使用した寄木細工ファイルの8つのパーティション。

私のSpark-Submit->

spark-submit --master spark://192.168.60.20:7077 --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf spark.driver.maxResultSize=2G --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf spark.scheduler.listenerbus.eventqueue.capacity=20000 --conf spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt

下のスタックトレース-

ERROR CodeGenerator:91 - failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
    at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.Java:361)
    at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.Java:234)
    at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.Java:446)
    at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.Java:313)
    at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.Java:235)
    at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.Java:204)
    at org.codehaus.commons.compiler.Cookable.cook(Cookable.Java:80)
    at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$Apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1417)
    at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
    at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
    at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.Java:3599)
    at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.Java:2379)
    at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.Java:2342)
    at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.Java:2257)
    at org.spark_project.guava.cache.LocalCache.get(LocalCache.Java:4000)
    at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.Java:4004)
    at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.Java:4874)
    at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
    at org.Apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579)
    at org.Apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
    at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128)
    at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
    at org.Apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
    at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
    at org.Apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)
    at org.Apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:150)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:150)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.Apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107)
    at org.Apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:102)
    at org.Apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:43)
    at org.Apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:97)
    at org.Apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:67)
    at org.Apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:91)
    at org.Apache.spark.sql.Dataset.persist(Dataset.scala:2924)
    at Sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
    at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
    at Java.lang.reflect.Method.invoke(Method.Java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357)
    at py4j.Gateway.invoke(Gateway.Java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132)
    at py4j.commands.CallCommand.execute(CallCommand.Java:79)
    at py4j.GatewayConnection.run(GatewayConnection.Java:238)
    at Java.lang.Thread.run(Thread.Java:748)
Caused by: org.codehaus.janino.InternalCompilerException: Code of method "processNext()V" of class "org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
7
Aakash Basu

問題は、Java DataFrameとDatasetを使用するプログラムからCatalystを使用して生成されたプログラムをJavaバイトコードにコンパイルすると、1つのメソッドのバイトコードのサイズが64 KB以上である必要があります。これは、Javaクラスファイルの制限と競合します。これは、発生する例外です。

非表示エラー:

spark.sql.codegen.wholeStage= "false"

回避策:

上記の制限による例外の発生を回避するために、Spark内での解決策は、コンパイルするメソッドを分割して、Javaバイトコードを64KBを超える可能性が高い場合に、複数のメソッドに分割することです。 CatalystはJavaプログラムを生成します。

パイプラインで永続化またはその他の論理的な分離を使用する

8
vaquar khan

このエラーは、コードに「チェックポイント」を追加することで解決しました。

チェックポイント= dataframe(data)をディスクに書き戻す必要があります。s3の場合は、新しいデータフレームで読み戻す必要があります。これにより、JVM sparkコンテナーが空になり、新しいコード

チェックポイントの詳細

https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md

0
Mandeep Singh

Vaquarによって書かれたように、パイプラインに論理的な分離を導入することは役立つはずです。

系統を切り、計画に中断を導入する1つの方法は、_DF -> RDD -> DF_往復変換のようです。

_df = spark_session.sparkContext.createDataFrame(df.rdd, schema=df.schema)
_

本のハイパフォーマンスSparkにはさらに、基礎となるJava RDDを使用してこれを実行する方が良い(速い)

j_rdd = df._jdf.toJavaRDD()とそのスキーマj_schema = df._jdf.schema()を使用して、新しいJava DataFrameを作成し、最後にそれをPySpark DataFrameに変換します。

_sql_ctx = df.sql_ctx
Java_sql_context = sql_ctx._jsqlContext
new_Java_df = Java_sql_context.createDataFrame(j_rdd, j_schema)
new_df = DataFrame(new_Java_df, sql_ctx)
_
0
Ferrard

Pyspark 2.3以降を使用している場合は、

spark = SparkSession.builder.master('local').appName('tow-way')\
        .config('spark.sql.codegen.wholeStage', 'false')\ ## <-- add this line
        .getOrCreate()
0
Dustin Sun