Spark= HDFSディレクトリの反復

Question

HDFSにディレクトリのディレクトリがあり、ディレクトリを反復処理したい。 SparkContextオブジェクトを使用してSparkでこれを行う簡単な方法はありますか？

Mike Park · Accepted Answer

_org.Apache.hadoop.fs.FileSystem_ を使用できます。具体的には、FileSystem.listFiles([path], true)

そしてSparkで...

_FileSystem.get(sc.hadoopConfiguration()).listFiles(..., true) _

編集

FileSystemのスキームに関連付けられているPathを取得することをお勧めします。

_path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true) _

Tagar · Answer

誰かが興味がある場合のPySparkバージョンは次のとおりです。

hadoop = sc._jvm.org.Apache.hadoop fs = hadoop.fs.FileSystem conf = hadoop.conf.Configuration() path = hadoop.fs.Path('/hivewarehouse/disc_mrt.db/unified_fact/') for f in fs.get(conf).listStatus(path): print f.getPath()

この特定のケースでは、disc_mrt.unified_fact Hiveテーブルを構成するすべてのファイルのリストを取得します。

ファイルサイズを取得するgetLen（）など、FileStatusオブジェクトの他のメソッドについては、以下で説明します。

クラスFileStatus

ozw1z5rd · Answer

import org.Apache.hadoop.fs.{FileSystem,Path} FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///tmp")).foreach( x => println(x.getPath ))

これは私のために働いた。

Sparkバージョン1.5.0-cdh5.5.2

Vincent Claes · Answer

これは私のために仕事をしました

FileSystem.get(new URI("hdfs://HAservice:9000"), sc.hadoopConfiguration).listStatus( new Path("/tmp/")).foreach( x => println(x.getPath ))

Mithril · Answer

@Tagarはリモートhdfsの接続方法を言っていませんでしたが、この答えはしました：

URI = sc._gateway.jvm.Java.net.URI Path = sc._gateway.jvm.org.Apache.hadoop.fs.Path FileSystem = sc._gateway.jvm.org.Apache.hadoop.fs.FileSystem Configuration = sc._gateway.jvm.org.Apache.hadoop.conf.Configuration fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration()) status = fs.listStatus(Path('/some_dir/yet_another_one_dir/')) for fileStatus in status: print(fileStatus.getPath())

Nitin · Answer

GlobStatusステータスでも試すことができます

val listStatus = org.Apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration).globStatus(new org.Apache.hadoop.fs.Path(url)) for (urlStatus <- listStatus) { println("urlStatus get Path:"+urlStatus.getPath()) }