Pythonで大きなファイルを読むための怠Methodな方法？

Question

4GBの非常に大きなファイルがあり、それを読み取ろうとするとコンピューターがハングします。だから私はそれを一つずつ読み、各ピースを処理した後、処理されたピースを別のファイルに保存し、次のピースを読みたいです。

これらのピースをyieldする方法はありますか？

lazy methodが欲しいです。

これらのピースをyieldする方法はありますか？

lazy methodが欲しいです。

nosklo · Accepted Answer

遅延関数を作成するには、単に yield を使用します。

def read_in_chunks(file_object, chunk_size=1024): """Lazy function (generator) to read a file piece by piece. Default chunk size: 1k.""" while True: data = file_object.read(chunk_size) if not data: break yield data f = open('really_big_file.dat') for piece in read_in_chunks(f): process_data(piece)

別のオプションは iter とヘルパー関数を使用することです：

f = open('really_big_file.dat') def read1k(): return f.read(1024) for piece in iter(read1k, ''): process_data(piece)

ファイルが行ベースの場合、ファイルオブジェクトはすでに行のレイジージェネレーターです。

for line in open('really_big_file.dat'): process_data(line)

unbeknown · Answer

コンピューター、OS、Pythonが64ビットの場合、 mmap module を使用して、メモリにファイルし、インデックスとスライスでアクセスします。ここにドキュメントからの例：

import mmap with open("hello.txt", "r+") as f: # memory-map the file, size 0 means whole file map = mmap.mmap(f.fileno(), 0) # read content via standard file methods print map.readline() # prints "Hello Python!" # read content via slice notation print map[:5] # prints "Hello" # update content using slice notation; # note that new content must have same size map[6:] = " world!
" # ... and read again using standard file methods map.seek(0) print map.readline() # prints "Hello world!" # close the map map.close()

コンピューター、OS、またはPythonのいずれかが32ビットの場合、大きなファイルをmmap-ingすることで、アドレス空間の大部分と starve メモリのプログラム。

Anshul · Answer

file.readlines（）は、返された行で読み取られた行数に近いオプションのサイズ引数を取ります。

bigfile = open('bigfilename','r') tmp_lines = bigfile.readlines(BUF_SIZE) while tmp_lines: process([line for line in tmp_lines]) tmp_lines = bigfile.readlines(BUF_SIZE)

user48678 · Answer

すでに多くの良い答えがありますが、最近同様の問題に遭遇し、必要な解決策がここにリストされていないため、このスレッドを補完できると考えました。

80％の時間、ファイルを1行ずつ読み取る必要があります。次に、この answer で提案されているように、ファイルオブジェクト自体をレイジージェネレータとして使用します。

with open('big.csv') as f: for line in f: process(line)

しかし、最近、非常に大きな（ほぼ）単一行のcsvに遭遇しました。実際には、行区切り文字は' 'ではなく'|'でした。

行ごとに読み取ることはオプションではありませんでしたが、行ごとに処理する必要がありました。
このcsvのフィールドの一部に'|'（フリーテキストユーザー入力）が含まれていたため、処理前のConverting' 'も' 'に問題はありませんでした。
少なくとも初期バージョンのlibでは、入力を1行ずつ読み取るようにハードコードされているであるため、csvライブラリの使用も除外されました。

私は次のスニペットを思いつきました：

def rows(f, chunksize=1024, sep='|'): """ Read a file where the row separator is '|' lazily. Usage: >>> with open('big.csv') as f: >>> for r in rows(f): >>> process(row) """ incomplete_row = None while True: chunk = f.read(chunksize) if not chunk: # End of file if incomplete_row is not None: yield incomplete_row break # Split the chunk as long as possible while True: i = chunk.find(sep) if i == -1: break # If there is an incomplete row waiting to be yielded, # prepend it and set it back to None if incomplete_row is not None: yield incomplete_row + chunk[:i] incomplete_row = None else: yield chunk[:i] chunk = chunk[i+1:] # If the chunk contained no separator, it needs to be appended to # the current incomplete row. if incomplete_row is not None: incomplete_row += chunk else: incomplete_row = chunk

大きなファイルと異なるチャンクサイズで正常にテストしました（アルゴリズムがサイズに依存しないことを確認するために、1バイトのチャンクサイズを試しました）。

myroslav · Answer

f = ... # file-like object, i.e. supporting read(size) function and # returning empty string '' when there is nothing to read def chunked(file, chunk_size): return iter(lambda: file.read(chunk_size), '') for data in chunked(f, 65536): # process the data

更新：アプローチは https://stackoverflow.com/a/4566523/38592 で最もよく説明されています

TonyCoolZhu · Answer

次のように書くことができると思います。

def read_file(path, block_size=1024): with open(path, 'rb') as f: while True: piece = f.read(block_size) if piece: yield piece else: return for piece in read_file(path): process_piece(piece)

bruce · Answer

Pythonの公式ドキュメントを参照してください https://docs.python.org/zh-cn/3/library/functions.html?#iter

たぶん、このメソッドはよりPython的です：

from functools import partial """A file object returned by open() is a iterator with read method which could specify current read's block size""" with open('mydata.db', 'r') as f_in: part_read = partial(f_in.read, 1024*1024) iterator = iter(part_read, b'') for index, block in enumerate(iterator, start=1): block = process_block(block) # process block data with open(f'{index}.txt', 'w') as f_out: f_out.write(block)

SilentGhost · Answer

私はやや似た状況にいます。バイト単位のチャンクサイズを知っているかどうかは明らかではありません。私は通常そうではありませんが、必要なレコード（行）の数はわかっています。

def get_line(): with open('4gb_file') as file: for i in file: yield i lines_required = 100 gen = get_line() chunk = [i for i, j in Zip(gen, range(lines_required))]

更新：noskloに感謝します。ここに私が意味するものがあります。それはほとんど機能しますが、チャンクの「間の」線が失われる点が異なります。

chunk = [next(gen) for i in range(lines_required)]

行を失うことなくトリックは行われますが、見た目はあまり良くありません。

sinzi · Answer

評判が悪いためコメントできませんが、SilentGhostsソリューションはfile.readlines（[sizehint]）を使用する方がはるかに簡単です。

pythonファイルメソッド

編集：SilentGhostは正しいですが、これは以下よりも優れているはずです：

s = "" for i in xrange(100): s += file.next()

crizCraig · Answer

行ごとに処理するには、これはエレガントなソリューションです。

 def stream_lines(file_name): file = open(file_name) while True: line = file.readline() if not line: file.close() break yield line

空白行がない限り。