hadoopの新しいディレクトリで.gzファイルを解凍する方法は？

Question

Hdfsのフォルダーに多数の.gzファイルがあります。これらすべての.gzファイルをhdfsの新しいフォルダーに解凍したいと思います。どうすればいいですか？

Manjunath Ballur · Accepted Answer

3つの異なる方法で達成することを考えることができます。

Linuxコマンドラインを使用

次のコマンドは私のために働いた。
```
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt 
```
Gzipで圧縮されたファイルはLinks.txt.gz
出力は/tmp/unzipped/Links.txt

Javaプログラムを使用して

Hadoop The Definitve Guide本、Codecsにセクションがあります。そのセクションには、CompressionCodecFactoryを使用して出力を解凍するプログラムがあります。私はそのコードをそのまま再生産しています：

package com.myorg.hadooptests; import org.Apache.hadoop.conf.Configuration; import org.Apache.hadoop.fs.FileSystem; import org.Apache.hadoop.fs.Path; import org.Apache.hadoop.io.IOUtils; import org.Apache.hadoop.io.compress.CompressionCodec; import org.Apache.hadoop.io.compress.CompressionCodecFactory; import Java.io.InputStream; import Java.io.OutputStream; import Java.net.URI; public class FileDecompressor { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path inputPath = new Path(uri); CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodec(inputPath); if (codec == null) { System.err.println("No codec found for " + uri); System.exit(1); } String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension()); InputStream in = null; OutputStream out = null; try { in = codec.createInputStream(fs.open(inputPath)); out = fs.create(new Path(outputUri)); IOUtils.copyBytes(in, out, conf); } finally { IOUtils.closeStream(in); IOUtils.closeStream(out); } } }

このコードは、gzファイルパスを入力として受け取ります。
これは次のように実行できます。

FileDecompressor <gzipped file name>

例えばgzipされたファイルに対して実行したとき：

FileDecompressor /tmp/Links.txt.gz

次の場所で解凍したファイルを取得しました：/tmp/Links.txt

解凍されたファイルは同じフォルダーに保存されます。したがって、このコードを変更して2つの入力パラメーターを取得する必要があります。<input file path> and <output folder>。

このプログラムを機能させると、シェル/ Perl/Pythonスクリプトを記述して、各入力に対してこのプログラムを呼び出すことができます。

Pigスクリプトを使用

簡単なPigスクリプトを作成して、これを実現できます。

動作する次のスクリプトを作成しました。
```
A = LOAD '/tmp/Links.txt.gz' USING PigStorage(); Store A into '/tmp/tmp_unzipped/' USING PigStorage(); mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/ 
```
このスクリプトを実行すると、解凍されたコンテンツは一時フォルダーに保存されます：/tmp/tmp_unzipped。このフォルダーには
```
/tmp/tmp_unzipped/_SUCCESS /tmp/tmp_unzipped/part-m-00000 
```
part-m-00000には解凍されたファイルが含まれます。

したがって、次のコマンドを使用して明示的に名前を変更し、最後に/tmp/tmp_unzippedフォルダー：
```
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/ 
```
したがって、このPigスクリプトを使用する場合は、ファイル名（Links.txt.gzおよびLinks.txt）のパラメーター化に注意するだけです。

繰り返しますが、このスクリプトを機能させると、シェル/ Perl/Pythonスクリプトを記述して、各入力に対してこのPigスクリプトを呼び出すことができます。

Atais · Answer

Bashソリューション

私の場合、ファイルの内容がわからないため、ファイルをパイプで解凍したくありませんでした。代わりに、Zipファイル内のすべてのファイルがHDFSに展開されるようにします。

簡単なbashスクリプトを作成しました。コメントは、あなたに何が起こっているかの手がかりを与えるべきです。以下に簡単な説明があります。

#!/bin/bash workdir=/tmp/unziphdfs/ cd $workdir # get all Zip files in a folder zips=$(hadoop fs -ls /yourpath/*.Zip | awk '{print $8}') for hdfsfile in $zips do echo $hdfsfile # copy to temp folder to unpack hdfs dfs -copyToLocal $hdfsfile $workdir hdfsdir=$(dirname "$hdfsfile") zipname=$(basename "$hdfsfile") # unpack locally and remove unzip $zipname rm -rf $zipname # copy files back to hdfs files=$(ls $workdir) for file in $files; do hdfs dfs -copyFromLocal $file $hdfsdir rm -rf $file done # optionally remove the Zip file from hdfs? # hadoop fs -rm -skipTrash $hdfsfile done

説明

hdfsディレクトリ内のすべての*.Zipファイルを取得します
1つずつ：Zipを一時ディレクトリ（ファイルシステム上）にコピーします
Unzip
抽出されたすべてのファイルをZipファイルのディレクトリにコピーします
掃除

/mypath/*/*.Zipを使用して、それぞれの多くのZipファイルのサブディレクトリ構造で動作させることができました。

幸運：）

tk421 · Answer

テキストファイルを圧縮している場合、hadoop fs -textは、他の一般的な圧縮形式（snappy、lzo）とともにgzipをサポートします。

hadoop fs -text /tmp/a.gz | hadoop fs -put - /tmp/uncompressed_a

Durga Viswanath Gadiraju · Answer

これは、Hiveを使用して実行できます（テキストデータであると想定）。

create external table source (t str) location '<directory_with_gz_files>'; create external table target (t str) location '<target_dir>'; insert into table target select * from source;

データは新しいファイルセットに圧縮解除されます。

名前を変更したくない場合、および実行中のノードに十分なストレージがある場合は、これを実行できます。

hadoop fs -get <your_source_directory> <directory_name> It will create a directory where you run hadoop command. cd to it and gunzip all the files cd .. hadoop fs -moveFromLocal <directory_name> <target_hdfs_path>