web-dev-qa-db-ja.com

DataFrame.saveAsTable( "df")がテーブルを別のHDFSホストに保存するのはなぜですか?

Spark(1.4.0)を使用してHive(1.13.1)を構成しました。Hiveからすべてのデータベースとテーブルにアクセスでき、ウェアハウスディレクトリは_hdfs://192.168.1.17:8020/user/Hive/warehouse_です。

しかし、df.saveAsTable("df")関数を使用して、Spark-Shellを介して(マスターを使用して)データフレームをHiveに保存しようとすると、このエラーが発生しました。

_15/07/03 14:48:59 INFO audit: ugi=user  ip=unknown-ip-addr  cmd=get_database: default   
15/07/03 14:48:59 INFO HiveMetaStore: 0: get_table : db=default tbl=df
15/07/03 14:48:59 INFO audit: ugi=user  ip=unknown-ip-addr  cmd=get_table : db=default tbl=df   
Java.net.ConnectException: Call From bdiuser-Vostro-3800/127.0.1.1 to 192.168.1.19:8020 failed on connection exception: Java.net.ConnectException: Connection refused; For more details see:  http://wiki.Apache.org/hadoop/ConnectionRefused
    at Sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at Sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.Java:57)
    at Sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.Java:45)
    at Java.lang.reflect.Constructor.newInstance(Constructor.Java:526)
    at org.Apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.Java:783)
    at org.Apache.hadoop.net.NetUtils.wrapException(NetUtils.Java:730)
    at org.Apache.hadoop.ipc.Client.call(Client.Java:1414)
    at org.Apache.hadoop.ipc.Client.call(Client.Java:1363)
    at org.Apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.Java:206)
    at com.Sun.proxy.$Proxy14.getFileInfo(Unknown Source)
    at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:57)
    at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
    at Java.lang.reflect.Method.invoke(Method.Java:606)
    at org.Apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.Java:190)
    at org.Apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.Java:103)
    at com.Sun.proxy.$Proxy14.getFileInfo(Unknown Source)
    at org.Apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.Java:699)
    at org.Apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.Java:1762)
    at org.Apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.Java:1124)
    at org.Apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.Java:1120)
    at org.Apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.Java:81)
    at org.Apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.Java:1120)
    at org.Apache.hadoop.fs.FileSystem.exists(FileSystem.Java:1398)
    at org.Apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:78)
    at org.Apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
    at org.Apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
    at org.Apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
    at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
    at org.Apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939)
    at org.Apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939)
    at org.Apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332)
    at org.Apache.spark.sql.Hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:239)
    at org.Apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
    at org.Apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
    at org.Apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
    at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
    at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
    at org.Apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939)
    at org.Apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939)
    at org.Apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:211)
    at org.Apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1517)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
    at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
    at $iwC$$iwC$$iwC.<init>(<console>:35)
    at $iwC$$iwC.<init>(<console>:37)
    at $iwC.<init>(<console>:39)
    at <init>(<console>:41)
    at .<init>(<console>:45)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:57)
    at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
    at Java.lang.reflect.Method.invoke(Method.Java:606)
    at org.Apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
    at org.Apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
    at org.Apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
    at org.Apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
    at org.Apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
    at org.Apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
    at org.Apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
    at org.Apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
    at org.Apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
    at org.Apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
    at org.Apache.spark.repl.SparkILoop.org$Apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
    at org.Apache.spark.repl.SparkILoop$$anonfun$org$Apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
    at org.Apache.spark.repl.SparkILoop$$anonfun$org$Apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at org.Apache.spark.repl.SparkILoop$$anonfun$org$Apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    at org.Apache.spark.repl.SparkILoop.org$Apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
    at org.Apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
    at org.Apache.spark.repl.Main$.main(Main.scala:31)
    at org.Apache.spark.repl.Main.main(Main.scala)
    at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:57)
    at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
    at Java.lang.reflect.Method.invoke(Method.Java:606)
    at org.Apache.spark.deploy.SparkSubmit$.org$Apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
    at org.Apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
    at org.Apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
    at org.Apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
    at org.Apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: Java.net.ConnectException: Connection refused
    at Sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at Sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.Java:744)
    at org.Apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.Java:206)
    at org.Apache.hadoop.net.NetUtils.connect(NetUtils.Java:529)
    at org.Apache.hadoop.net.NetUtils.connect(NetUtils.Java:493)
    at org.Apache.hadoop.ipc.Client$Connection.setupConnection(Client.Java:604)
    at org.Apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.Java:699)
    at org.Apache.hadoop.ipc.Client$Connection.access$2800(Client.Java:367)
    at org.Apache.hadoop.ipc.Client.getConnection(Client.Java:1462)
    at org.Apache.hadoop.ipc.Client.call(Client.Java:1381)
    ... 86 more
_

このエラーが発生したときに、プログラムがHDFS接続用の別のホストでテーブルを保存しようとしたことがわかりました。

そして、私は別のワーカーのspark-Shellでも試しましたが、同じエラーが発生しました。

6
Kaushal

saveAsTableを使用すると、Sparkが保存されるデフォルトの場所は、ドキュメントに基づいてHiveMetastoreによって制御されます。別のオプションは、saveAsParquetFileを使用して次に、そのパスをHiveメタストアに登録しますOR新しいDataFrameWriterインターフェイスを使用して、パスオプションwrite.format(source).mode(mode).options(options).saveAsTable(tableName)を指定します。

11
Holden

以下の例を見つけてください:

val options = Map("path" -> hiveTablePath)
result.write.format("orc").partitionBy("partitiondate").options(options).mode(SaveMode.Append).saveAsTable(hiveTable)

私は 私のブログでこれをもう少し説明しました を持っています。

21
Deepika Khera

sparkデータフレームを既存のsparkテーブルに書き込むことができます。

以下の例を見つけてください:

df.write.mode("overwrite").saveAsTable("database.tableName")
4