CSVファイルをインポートして、区切り文字を自動的に推測できますか？

Question

2種類のCSVファイルをインポートします。一部は「;」を使用します区切り文字などには「、」を使用します。これまでのところ、次の2行を切り替えてきました。

reader=csv.reader(f,delimiter=';')

または

reader=csv.reader(f,delimiter=',')

区切り文字を指定せず、プログラムに正しい区切り文字をチェックさせることは可能ですか？

以下のソリューション（Blenderとsharth）は、カンマ区切りファイル（Librofficeで生成）ではうまく機能するようですが、セミコロン区切りファイル（MS Officeで生成）ではうまくいきません。 1つのセミコロン区切りファイルの最初の行は次のとおりです。

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes 1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

rom · Accepted Answer

この問題を解決するために、ファイルの最初の行（ヘッダー）を読み取り、区切り文字を検出する関数を作成しました。

def detectDelimiter(csvFile): with open(csvFile, 'r') as myCsvfile: header=myCsvfile.readline() if header.find(";")!=-1: return ";" if header.find(",")!=-1: return "," #default delimiter (MS Office export) return ";"

Bill Lynch · Answer

csvモジュールは、この問題に対して csv sniffer の使用を推奨しているようです。

彼らは次の例を示しますが、私はあなたのケースに合わせました。

with open('example.csv', 'rb') as csvfile: # python 3: 'r',newline="" dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,") csvfile.seek(0) reader = csv.reader(csvfile, dialect) # ... process CSV file contents here ...

試してみましょう。

[9:13am][wlynch@watermelon /tmp] cat example #!/usr/bin/env python import csv def parse(filename): with open(filename, 'rb') as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,') csvfile.seek(0) reader = csv.reader(csvfile, dialect) for line in reader: print line def main(): print 'Comma Version:' parse('comma_separated.csv') print print 'Semicolon Version:' parse('semicolon_separated.csv') print print 'An example from the question (kingdom.csv)' parse('kingdom.csv') if __== '__main__': main()

サンプル入力

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv test,box,foo round,the,bend [9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv round;the;bend who;are;you [9:22am][wlynch@watermelon /tmp] cat kingdom.csv ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes 1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

そして、サンプルプログラムを実行すると：

[9:14am][wlynch@watermelon /tmp] ./example Comma Version: ['test', 'box', 'foo'] ['round', 'the', 'bend'] Semicolon Version: ['round', 'the', 'bend'] ['who', 'are', 'you'] An example from the question (kingdom.csv) ['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes'] ['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document'] ['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']

また、python使用しているバージョン）のバージョンに注意する価値があります。

[9:20am][wlynch@watermelon /tmp] python -V Python 2.7.2

Andrew Basile · Answer

、（コンマ）と|の両方を扱うプロジェクトを考える（垂直バー）区切られたCSVファイル。整形式です。次のことを試しました（ https://docs.python.org/2/library/csv.html#csv.Sniffer で指定）：

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')

ただし、区切り文字付きファイルでは、「区切り文字を判別できませんでした」という例外が返されました。各行に同じ数の区切り文字がある場合（引用符で囲まれているものは数えない）、スニフヒューリスティックが最適に機能すると推測するのが妥当と思われます。そのため、ファイルの最初の1024バイトを読み取る代わりに、最初の2行全体を読み取ってみました。

temp_lines = csvfile.readline() + '
' + csvfile.readline() dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')

これまでのところ、これは私にとってうまく機能しています。

Vladir Parrado Cruz · Answer

DictReaderを使用している場合は、次のことができます。

#!/usr/bin/env python import csv def parse(filename): with open(filename, 'rb') as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,') csvfile.seek(0) reader = csv.DictReader(csvfile, dialect=dialect) for line in reader: print(line['ReleveAnnee'])

これをPython 3.5で使用しましたが、このように機能しました。

twalberg · Answer

これに対する完全に一般的な解決策があるとは思わない（私が使用する理由の1つは,は区切り文字として、データフィールドの一部に;...）。決定するための単純なヒューリスティックは、最初の行（またはそれ以上）を単純に読み取り、,および;文字（含まれている場合は、おそらく引用符内の文字を無視します.csvファイルはエントリを適切かつ一貫して引用します）、2つの頻度が高いほど正しい区切り文字であると推測します。