csvファイルからチャンクごとにデータを読み取って反転し、新しいcsvファイルにコピーします

Question

非常に大きなcsvファイルを扱っていると仮定します。したがって、データをチャンクごとにメモリに読み込むことしかできません。予想されるイベントの流れは次のとおりです。

1）パンダを使用してcsvからデータのチャンク（例：10行）を読み取ります。

2）データの順序を逆にします

3）各行を逆に新しいcsvファイルにコピーします。したがって、各チャンク（10行）は、最初から逆の順序でcsvに書き込まれます。

結局、csvファイルは逆の順序である必要があり、これはWindowsOSのメモリにファイル全体をロードせずに実行する必要があります。

古いものから最新のものまでのデータが必要な時系列予測を実行しようとしています（1行目の最も古いエントリ）。ファイル全体をメモリにロードできません。可能であれば、チャンクごとに一度にロードする方法を探しています。

私が試したデータセットtrain.csv of Rossmannデータセット kaggleから。あなたはこれからそれを得ることができます github repo

私の試みでは、行が新しいcsvファイルに正しくコピーされません。

以下に私のコードを示します：

import pandas as pd import csv def reverse(): fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday", "SchoolHoliday"] with open('processed_train.csv', mode='a') as stock_file: writer = csv.writer(stock_file,delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) writer.writerow(fields) for chunk in pd.read_csv("train.csv", chunksize=10): store_data = chunk.reindex(index=chunk.index[::-1]) append_data_csv(store_data) def append_data_csv(store_data): with open('processed_train.csv', mode='a') as store_file: writer = csv.writer(store_file,delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) for index, row in store_data.iterrows(): print(row) writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'], row['Customers'],row['Open'],row['Promo'], row['StateHoliday'],row['SchoolHoliday']]) reverse()

前もって感謝します

gustavovelascoh · Answer

Bashを使用すると、最初の行を除くファイル全体をテールし、それを逆にして、次のように保存できます。

tail -n +2 train.csv | tac > train_rev.csv

ヘッダーを反転ファイルに保持する場合は、最初にヘッダーを書き込んでから、反転コンテンツを追加します

head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv

Mark Warburton · Answer

これはあなたが要求していることを正確に行いますが、パンダはありません。（ファイル全体をRAMに読み込むのではなく）intest.csvを1行ずつ読み取ります。最後にouttest.csvファイルに集約される一連のチャンクファイルを使用して、ファイルシステムを使用してほとんどの処理を実行します。 maxLinesを変更すると、生成されるチャンクファイルの数とRAMの消費量）を最適化できます（数値が大きいほど消費量が多くなりますRAMですが、生成されるチャンクファイルの数は少なくなります）。 CSVヘッダーの最初の行を保持する場合は、keepHeaderをTrueに設定します。Falseに設定すると、最初の行を含むファイル全体が反転します。

キックについては、6MBのcsvテストファイルで128GBのフラッシュドライブを使用して古いRaspberry Piでこれを実行しましたが、ほとんどすぐに戻ったために問題が発生したと思ったので、低速のハードウェアでも高速です。 1つの標準pythonライブラリ関数（削除）のみをインポートするため、非常に移植性があります。このコードの利点の1つは、ファイルポインターを再配置しないことです。1つの制限は、CSVファイルでは機能しないことです。そのユースケースでは、pandasがチャンクを読み取るための最良のソリューションになります。

from os import remove def writechunk(fileCounter, reverseString): outFile = 'tmpfile' + str(fileCounter) + '.csv' with open(outFile, 'w') as outfp: outfp.write(reverseString) return def main(): inFile = 'intest.csv' outFile = 'outtest.csv' # This is our chunk expressed in lines maxLines = 10 # Is there a header line we want to keep at the top of the output file? keepHeader = True fileCounter = 0 lineCounter = 0 with open(inFile) as infp: reverseString = '' line = infp.readline() if (line and keepHeader): headerLine = line line = infp.readline() while (line): lineCounter += 1 reverseString = line + reverseString if (lineCounter == maxLines): fileCounter += 1 lineCounter = 0 writechunk(fileCounter, reverseString) reverseString = '' line = infp.readline() # Write any leftovers to a chunk file if (lineCounter != 0): fileCounter += 1 writechunk(fileCounter,reverseString) # Read the chunk files backwards and append each to the outFile with open(outFile, 'w') as outfp: if (keepHeader): outfp.write(headerLine) while (fileCounter > 0): chunkFile = 'tmpfile' + str(fileCounter) + '.csv' with open(chunkFile, 'r') as infp: outfp.write(infp.read()) remove(chunkFile) fileCounter -= 1 if __name__ == '__main__': main()

jpp · Answer

十分なハードディスク容量がある場合は、チャンクで読み取り、反転して保存できます。次に、保存されているチャンクを逆の順序で取得し、新しいcsvファイルに書き込みます。

以下は、Pandasの例で、pickle（パフォーマンス効率のため）とgzip（ストレージ効率のため）も使用しています。

import pandas as pd, numpy as np # create a dataframe for demonstration purposes df = pd.DataFrame(np.arange(5*9).reshape((-1, 5))) df.to_csv('file.csv', index=False) # number of rows we want to chunk by n = 3 # iterate chunks, output to pickle files for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)): chunk.iloc[::-1].to_pickle(f'file_pkl_{idx:03}.pkl.gzip', compression='gzip') # open file in amend mode and write chunks in reverse # idx stores the index of the last pickle file written with open('out.csv', 'a') as fout: for i in range(idx, -1, -1): chunk_pkl = pd.read_pickle(f'file_pkl_{i:03}.pkl.gzip', compression='gzip') chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True) # read new file to check results df_new = pd.read_csv('out.csv') print(df_new) 0 1 2 3 4 0 40 41 42 43 44 1 35 36 37 38 39 2 30 31 32 33 34 3 25 26 27 28 29 4 20 21 22 23 24 5 15 16 17 18 19 6 10 11 12 13 14 7 5 6 7 8 9 8 0 1 2 3 4

BernardL · Answer

追加のオーバーヘッドを導入するだけなので、ファイルの解析またはストリーミングにpandasを使用することはお勧めしません。そのための最良の方法は、ファイルを下から上に読み取ることです。さて、このコードの大部分は実際には here から来ており、ファイルを受け取り、ジェネレーターでその逆を返します。これはあなたが望むものだと私は信じています。

私がしたことはあなたのファイルでそれをテストしただけですtrain.csv提供されたリンクから、結果を新しいファイルに出力します。

import os def reverse_readline(filename, buf_size=8192): """a generator that returns the lines of a file in reverse order""" with open(filename) as fh: segment = None offset = 0 fh.seek(0, os.SEEK_END) file_size = remaining_size = fh.tell() while remaining_size > 0: offset = min(file_size, offset + buf_size) fh.seek(file_size - offset) buffer = fh.read(min(remaining_size, buf_size)) remaining_size -= buf_size lines = buffer.split('
') # the first line of the buffer is probably not a complete line so # we'll save it and append it to the last line of the next buffer # we read if segment is not None: # if the previous chunk starts right from the beginning of line # do not concact the segment to the last line of new chunk # instead, yield the segment first if buffer[-1] != '
': lines[-1] += segment else: yield segment segment = lines[0] for index in range(len(lines) - 1, 0, -1): if lines[index]: yield lines[index] # Don't yield None if the file was empty if segment is not None: yield segment reverse_gen = reverse_readline('train.csv') with open('rev_train.csv','w') as f: for row in reverse_gen: f.write('{}
'.format(row))

基本的には、改行が見つかるまで逆に読み取り、ファイルの下から上にlineを生成します。それを行うための非常に興味深い方法。