web-dev-qa-db-ja.com

Spark + S3 - error - java.lang.ClassNotFoundException:class org.apache.hadoop.fs.s3a.s3afilesystem)

私はZeppelinノートブックからPysparkプログラムを送信しているspark EC2クラスターを持っています。 hadoop-aws-2.7.3.jarとaws-java-sdk-1.11.179.jarをロードして、sparkインスタンスの/ opt/spark/jarsディレクトリに配置しました。 java.lang.noclassDeffounterror:com/Amazonaws/AmazonServiceException

なぜsparkがjarsを見ていないのですか?私はすべてのスレーブの中にjarsをjarsし、マスターとスレーブのspark-defaults.confを指定する必要がありますか?新しいJARファイルを認識するためにZeppelinで設定する必要があるものはありますか?

sparkマスターにJARファイル/ opt/spark/jarを配置しました。 spark-defaults.confを作成して行を追加しました

_spark.hadoop.fs.s3a.access.key     [ACCESS KEY]
spark.hadoop.fs.s3a.secret.key     [SECRET KEY]
spark.hadoop.fs.s3a.impl           org.Apache.hadoop.fs.s3a.S3AFileSystem
spark.driver.extraClassPath        /opt/spark/jars/hadoop-aws-2.7.3.jar:/opt/spark/jars/aws-Java-sdk-1.11.179.jar
_

spark Masterに送信するsparkを送信するZeppelinインタプリタがあります。

また、ゼロもスレーブの/ opt/spark/jarsに配置しましたが、Spark-Deafults.confを作成しませんでした。

_%spark.pyspark

#importing necessary libaries
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col

# add aws credentials
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "[ACCESS KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "[SECRET KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.Apache.hadoop.fs.s3a.S3AFileSystem")

#creating the context
sqlContext = SQLContext(sc)

#reading the first csv file and store it in an RDD
rdd1= sc.textFile("s3a://filepath/baby-names.csv").map(lambda line: line.split(","))

#removing the first row as it contains the header
rdd1 = rdd1.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)

#converting the RDD into a dataframe
df1 = rdd1.toDF(['year','name', 'percent', 'sex'])

#print the dataframe
df1.show()
_

エラーが投げられたエラー:

_
Py4JJavaError: An error occurred while calling z:org.Apache.spark.api.python.PythonRDD.runJob.
: org.Apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 10.11.93.90, executor 1): Java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
    at Java.lang.Class.forName0(Native Method)
    at Java.lang.Class.forName(Class.Java:348)
    at org.Apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.Java:2134)
    at org.Apache.hadoop.conf.Configuration.getClassByName(Configuration.Java:2099)
    at org.Apache.hadoop.conf.Configuration.getClass(Configuration.Java:2193)
    at org.Apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.Java:2654)
    at org.Apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.Java:2667)
    at org.Apache.hadoop.fs.FileSystem.access$200(FileSystem.Java:94)
    at org.Apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.Java:2703)
    at org.Apache.hadoop.fs.FileSystem$Cache.get(FileSystem.Java:2685)
    at org.Apache.hadoop.fs.FileSystem.get(FileSystem.Java:373)
    at org.Apache.hadoop.fs.Path.getFileSystem(Path.Java:295)
    at org.Apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.Java:108)
    at org.Apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.Java:67)
    at org.Apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
    at org.Apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
    at org.Apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
    at org.Apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
    at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.Apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.Apache.spark.scheduler.Task.run(Task.scala:123)
    at org.Apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.Apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1149)
    at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:624)
    at Java.lang.Thread.run(Thread.Java:748)
Caused by: Java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
    at Java.net.URLClassLoader.findClass(URLClassLoader.Java:382)
    at Java.lang.ClassLoader.loadClass(ClassLoader.Java:424)
    at Sun.misc.Launcher$AppClassLoader.loadClass(Launcher.Java:349)
    at Java.lang.ClassLoader.loadClass(ClassLoader.Java:357)
    ... 34 more

Driver stacktrace:
    at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.Apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:153)
    at org.Apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
    at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62)
    at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
    at Java.lang.reflect.Method.invoke(Method.Java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357)
    at py4j.Gateway.invoke(Gateway.Java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132)
    at py4j.commands.CallCommand.execute(CallCommand.Java:79)
    at py4j.GatewayConnection.run(GatewayConnection.Java:238)
    at Java.lang.Thread.run(Thread.Java:748)
Caused by: Java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
    at Java.lang.Class.forName0(Native Method)
    at Java.lang.Class.forName(Class.Java:348)
    at org.Apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.Java:2134)
    at org.Apache.hadoop.conf.Configuration.getClassByName(Configuration.Java:2099)
    at org.Apache.hadoop.conf.Configuration.getClass(Configuration.Java:2193)
    at org.Apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.Java:2654)
    at org.Apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.Java:2667)
    at org.Apache.hadoop.fs.FileSystem.access$200(FileSystem.Java:94)
    at org.Apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.Java:2703)
    at org.Apache.hadoop.fs.FileSystem$Cache.get(FileSystem.Java:2685)
    at org.Apache.hadoop.fs.FileSystem.get(FileSystem.Java:373)
    at org.Apache.hadoop.fs.Path.getFileSystem(Path.Java:295)
    at org.Apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.Java:108)
    at org.Apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.Java:67)
    at org.Apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
    at org.Apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
    at org.Apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
    at org.Apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
    at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.Apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.Apache.spark.scheduler.Task.run(Task.scala:123)
    at org.Apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.Apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1149)
    at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:624)
    ... 1 more
Caused by: Java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
    at Java.net.URLClassLoader.findClass(URLClassLoader.Java:382)
    at Java.lang.ClassLoader.loadClass(ClassLoader.Java:424)
    at Sun.misc.Launcher$AppClassLoader.loadClass(Launcher.Java:349)
    at Java.lang.ClassLoader.loadClass(ClassLoader.Java:357)
    ... 34 more
_
10
peterlandis

私のために働いた後

My System Config:

Ubuntu 16.04.6 LTS Python3.7.7 OpenJDKバージョン1.8.0_252 Spark-2.4.5-Bin-Hadoop2.7

  1. Pyspark_python pathを設定します。$ spark_home/conf/spark-env.shに次の行を追加します。

    pYSPARK_PYTHON = PYTHON_ENV_PATH/BIN/Python.

  2. Pysparkを起動します

    pyspark - packages com.amazonaws:aws-java-sdk-pom:1.11.760、org.apache.hadoop:hadoop-aws:2.7.0 --conf spark.hadoop.fs.s3a.endpoint = s3.us- West-2.Amazonaws.com

    com.amazonaws:aws-java-sdk-pom:1.11.760:JDKバージョン:hadoop-aws:2.7.0:あなたのHadoopバージョンのversions s3.us-west-2.amazonaws.com:あなたに依存S3ロケーション

3. S3からのデータを読み取ります

df2=spark.read.parquet("s3a://s3location_file_path")
 _

クレジット

0
Pranjal Gharat

上記で何も機能しない場合は、欠けているクラスの猫とgrepを行います。ジャーが破損している可能性が高い。たとえば、クラスAmazonServiceExceptionが見つからなかった場合は、以下のようにjarが既に存在するgrepを実行してください。

grep "AmazonServiceException" *.jar _

0
hellodk

このファイルに次の追加を追加してくださいhadoop/etc/hadoop/core-site.xml

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>***</value>
</property>
<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>***</value>
</property>

Hadoopインストールディレクトリ内で、AWS JARの検索、Macインストールディレクトリの場合は/usr/local/Cellar/hadoop/

find . -type f -name "*aws*"

Sudo cp hadoop/share/hadoop/tools/lib/aws-Java-sdk-1.7.4.jar hadoop/share/hadoop/common/lib/
Sudo cp hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.5.jar hadoop/share/hadoop/common/lib/

クレジット

0
Vishrant