ディスクに書き込まずに.Zipファイルをダウンロードして解凍する

Question

URLから.Zipファイルのリストをダウンロードし、Zipファイルを抽出してディスクに書き込む最初のpythonスクリプトを動作させることができました。

私は今、次のステップを達成するために途方に暮れています。

私の主な目標は、Zipファイルをダウンロードして展開し、コンテンツ（CSVデータ）をTCPストリームを介して渡すことです。私はそれで逃げることができました。

これは現在動作しているスクリプトですが、残念ながらファイルをディスクに書き込む必要があります。

import urllib, urllister import zipfile import urllib2 import os import time import pickle # check for extraction directories existence if not os.path.isdir('downloaded'): os.makedirs('downloaded') if not os.path.isdir('extracted'): os.makedirs('extracted') # open logfile for downloaded data and save to local variable if os.path.isfile('downloaded.pickle'): downloadedLog = pickle.load(open('downloaded.pickle')) else: downloadedLog = {'key':'value'} # remove entries older than 5 days (to maintain speed) # path of Zip files zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/Zip/files" # retrieve list of URLs from the webservers usock = urllib.urlopen(zipFileURL) parser = urllister.URLLister() parser.feed(usock.read()) usock.close() parser.close() # only parse urls for url in parser.urls: if "PUBLIC_P5MIN" in url: # download the file downloadURL = zipFileURL + url outputFilename = "downloaded/" + url # check if file already exists on disk if url in downloadedLog or os.path.isfile(outputFilename): print "Skipping " + downloadURL continue print "Downloading ",downloadURL response = urllib2.urlopen(downloadURL) zippedData = response.read() # save data to disk print "Saving to ",outputFilename output = open(outputFilename,'wb') output.write(zippedData) output.close() # extract the data zfobj = zipfile.ZipFile(outputFilename) for name in zfobj.namelist(): uncompressed = zfobj.read(name) # save uncompressed data to disk outputFilename = "extracted/" + name print "Saving extracted file to ",outputFilename output = open(outputFilename,'wb') output.write(uncompressed) output.close() # send data via tcp stream # file successfully downloaded and extracted store into local log and filesystem log downloadedLog[url] = time.time(); pickle.dump(downloadedLog, open('downloaded.pickle', "wb" ))

senderle · Accepted Answer

私の提案は、 StringIO オブジェクトを使用することです。ファイルをエミュレートしますが、メモリ内に常駐します。したがって、次のようなことができます。

# get_Zip_data() gets a Zip archive containing 'foo.txt', reading 'hey, foo' from StringIO import StringIO zipdata = StringIO() zipdata.write(get_Zip_data()) myzipfile = zipfile.ZipFile(zipdata) foofile = myzipfile.open('foo.txt') print foofile.read() # output: "hey, foo"

またはより単純に（Vishalに謝罪）：

myzipfile = zipfile.ZipFile(StringIO(get_Zip_data())) for name in myzipfile.namelist(): [ ... ]

In Python 3では、StringIOの代わりにBytesIOを使用します。

Vishal · Answer

以下は、圧縮されたcsvファイルを取得するために使用したコードスニペットです。ご覧ください。

Python 2：

_from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen resp = urlopen("http://www.test.com/file.Zip") zipfile = ZipFile(StringIO(resp.read())) for line in zipfile.open(file).readlines(): print line _

Python：

_from io import BytesIO from zipfile import ZipFile from urllib.request import urlopen # or: requests.get(url).content resp = urlopen("http://www.test.com/file.Zip") zipfile = ZipFile(BytesIO(resp.read())) for line in zipfile.open(file).readlines(): print(line.decode('utf-8')) _

ここで、fileは文字列です。渡したい実際の文字列を取得するには、zipfile.namelist()を使用できます。例えば、

_resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.Zip') zipfile = ZipFile(BytesIO(resp.read())) zipfile.namelist() # ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms'] _

Zubo · Answer

更新されたPython 3バージョンのVishalの優れた答えで、Python 2既に言及されている可能性があります。

_from io import BytesIO from zipfile import ZipFile import urllib.request url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.Zip") with ZipFile(BytesIO(url.read())) as my_Zip_file: for contained_file in my_Zip_file.namelist(): # with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output: for line in my_Zip_file.open(contained_file).readlines(): print(line) # output.write(line) _

必要な変更：

注意：

Python 3）では、印刷出力行は次のようになります。_b'some text'_。これは文字列ではないため、予想されます-バイトストリームを読み込んでいます。 Dan04の優れた答えを見てください。

私が行ったいくつかの小さな変更：

Docs に従って_with ... as_の代わりに_zipfile = ..._を使用します。
スクリプトはnamelist()を使用して、Zip内のすべてのファイルを循環させ、その内容を印刷します。
ZipFileオブジェクトの作成をwithステートメントに移動しましたが、それが良いかどうかはわかりません。
NumenorForLifeのコメントに応じて、バイトストリームをファイルに（Zipのファイルごとに）書き込むオプションを追加（およびコメントアウト）しました。ファイル名の先頭に_"unzipped_and_read_"_と拡張子_".file"_を追加します（バイト文字列を含むファイルには_".txt"_を使用しないことをお勧めします）。もちろん、コードのインデントは、使用する場合は調整する必要があります。
- ここで注意する必要があります-バイト文字列があるため、バイナリモードを使用するため、_"wb"_;とにかく、バイナリを書くとワームの缶が開かれると感じています...
サンプルファイル N/LOCODEテキストアーカイブを使用しています。

私がしなかったこと：

NumenorForLifeはZipをディスクに保存することについて尋ねました。私は彼がそれが何を意味したのか分かりません-Zipファイルをダウンロードしますか？それは別のタスクです。 Oleh Prypinの優れた答えを参照してください。

方法は次のとおりです。

_import urllib.request import shutil with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file: shutil.copyfileobj(response, out_file) _

ninjagecko · Answer

rAMにある一時ファイルに書き込む

tempfileモジュール（ http://docs.python.org/library/tempfile.html ）には次のものがあります：

tempfile.SpooledTemporaryFile（[max_size = 0 [、mode = 'w + b' [、bufsize = -1 [、suffix = '' [、prefix = 'tmp' [、dir = None]]]]]]））

この関数は、TemporaryFile（）とまったく同じように動作します。ただし、ファイルサイズがmax_sizeを超えるまで、またはファイルのfileno（）メソッドが呼び出されるまでメモリにデータがスプールされます。（）。

結果のファイルには、rollover（）という1つの追加メソッドがあります。このメソッドは、サイズに関係なく、ファイルをディスク上のファイルにロールオーバーします。

返されるオブジェクトは、rollover（）が呼び出されたかどうかに応じて、_file属性がStringIOオブジェクトまたは真のファイルオブジェクトのいずれかであるファイルのようなオブジェクトです。このファイルのようなオブジェクトは、通常のファイルと同様にwithステートメントで使用できます。

バージョン2.6の新機能。

または、あなたが怠け者で、tmpfsがマウントされている/tmp Linuxでは、そこにファイルを作成するだけでかまいませんが、自分で削除して命名に対処する必要があります。

lababidi · Answer

完全を期すためにPython3の回答を追加したいと思います。

from io import BytesIO from zipfile import ZipFile import requests def get_Zip(file_url): url = requests.get(file_url) zipfile = ZipFile(BytesIO(url.content)) Zip_names = zipfile.namelist() if len(Zip_names) == 1: file_name = Zip_names.pop() extracted_file = zipfile.open(file_name) return extracted_file return [zipfile.open(file_name) for file_name in Zip_names]

Akson · Answer

requestsを使用して他の回答に追加する：

 # download from web import requests url = 'http://mlg.ucd.ie/files/datasets/bbc.Zip' content = requests.get(url) # unzip the content from io import BytesIO from zipfile import ZipFile f = ZipFile(BytesIO(content.content)) print(f.namelist()) # outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']

help（f）を使用して、たとえばextractall（）これは、後でwith openで使用できるZipファイルの内容を抽出します。

plowman · Answer

Vishalの回答では、ディスク上にファイルが存在しない場合にファイル名がどうなるかは明らかではありませんでした。私は彼の答えを修正して、ほとんどのニーズに合わせて修正せずに動作するようにしました。

from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen def unzip_string(zipped_string): unzipped_string = '' zipfile = ZipFile(StringIO(zipped_string)) for name in zipfile.namelist(): unzipped_string += zipfile.open(name).read() return unzipped_string

Martien Lubberink · Answer

Vishalの例は、どんなに素晴らしいものであっても、ファイル名に関しては混乱を招き、「zipfile」を再定義するメリットはわかりません。

以下は、いくつかのファイルを含むZipをダウンロードする私の例です。そのうちの1つは、後でpandas DataFrameに読み込むcsvファイルです：

from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen import pandas url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.Zip") zf = ZipFile(StringIO(url.read())) for item in zf.namelist(): print("File in Zip: "+ item) # find the first matching csv file in the Zip: match = [s for s in zf.namelist() if ".csv" in s][0] # the first line of the file contains a string - that line shall de ignored, hence skiprows df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])

（注、私はPython 2.7.13）を使用します）

これは私のために働いた正確なソリューションです。 StringIOを削除してPythonライブラリを追加して、IO 3バージョン用に少し調整しました

Python 3バージョン

from io import BytesIO from zipfile import ZipFile import pandas import requests url = "https://www.nseindia.com/content/indices/mcwb_jun19.Zip" content = requests.get(url) zf = ZipFile(BytesIO(content.content)) for item in zf.namelist(): print("File in Zip: "+ item) # find the first matching csv file in the Zip: match = [s for s in zf.namelist() if ".csv" in s][0] # the first line of the file contains a string - that line shall de ignored, hence skiprows df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])