Python 2.7を使用したUnicodeを含むCSVファイルの読み取りと書き込み

Question

私はPythonが初めてであり、Pythonを使用してCSVファイルを読み書きする方法について質問があります。私のファイルにはドイツ語、フランス語などが含まれています。私のコードによると、ファイルはPythonで正しく読み取ることができますが、新しいCSVファイルに書き込むと、ユニコードが奇妙な文字になります。

データは次のようなものです。 enter image description here

そして私のコードは：

import csv f=open('xxx.csv','rb') reader=csv.reader(f) wt=open('lll.csv','wb') writer=csv.writer(wt,quoting=csv.QUOTE_ALL) wt.close() f.close()

結果は次のようになります。 enter image description here

問題を解決するために私がすべきことを教えてください。どうもありがとうございました！

Oz123 · Answer

別の選択肢：

Unicodecsvパッケージのコードを使用して...

https://pypi.python.org/pypi/unicodecsv/

>>> import unicodecsv as csv >>> from io import BytesIO >>> f = BytesIO() >>> w = csv.writer(f, encoding='utf-8') >>> _ = w.writerow((u'é', u'ñ')) >>> _ = f.seek(0) >>> r = csv.reader(f, encoding='utf-8') >>> next(r) == [u'é', u'ñ'] True

このモジュールは、STDLIB csvモジュールとAPI互換です。

dawg · Answer

適切にエンコードおよびデコードしてください。

この例では、utf-8のいくつかのサンプルテキストをcsvファイルにラウンドトリップし、元に戻します。

# -*- coding: utf-8 -*- import csv tests={'German': [u'Straße',u'auslösen',u'zerstören'], 'French': [u'français',u'américaine',u'épais'], 'Chinese': [u'中國的',u'英語',u'美國人']} with open('/tmp/utf.csv','w') as fout: writer=csv.writer(fout) writer.writerows([tests.keys()]) for row in Zip(*tests.values()): row=[s.encode('utf-8') for s in row] writer.writerows([row]) with open('/tmp/utf.csv','r') as fin: reader=csv.reader(fin) for row in reader: temp=list(row) fmt=u'{:<15}'*len(temp) print fmt.format(*[s.decode('utf-8') for s in temp])

プリント：

German Chinese French Straße 中國的 français auslösen 英語 américaine zerstören 美國人 épais

Mark Tolonen · Answer

csvモジュールのドキュメントの最後に、Unicodeの処理方法を示す例があります。その下から直接コピーされます example 。読み書きされる文字列はUnicode文字列であることに注意してください。たとえば、UnicodeWriter.writerowsにバイト文字列を渡さないでください。

import csv,codecs,cStringIO class UTF8Recoder: def __init__(self, f, encoding): self.reader = codecs.getreader(encoding)(f) def __iter__(self): return self def next(self): return self.reader.next().encode("utf-8") class UnicodeReader: def __init__(self, f, dialect=csv.Excel, encoding="utf-8-sig", **kwds): f = UTF8Recoder(f, encoding) self.reader = csv.reader(f, dialect=dialect, **kwds) def next(self): '''next() -> unicode This function reads and returns the next line as a Unicode string. ''' row = self.reader.next() return [unicode(s, "utf-8") for s in row] def __iter__(self): return self class UnicodeWriter: def __init__(self, f, dialect=csv.Excel, encoding="utf-8-sig", **kwds): self.queue = cStringIO.StringIO() self.writer = csv.writer(self.queue, dialect=dialect, **kwds) self.stream = f self.encoder = codecs.getincrementalencoder(encoding)() def writerow(self, row): '''writerow(unicode) -> None This function takes a Unicode string and encodes it to the output. ''' self.writer.writerow([s.encode("utf-8") for s in row]) data = self.queue.getvalue() data = data.decode("utf-8") data = self.encoder.encode(data) self.stream.write(data) self.queue.truncate(0) def writerows(self, rows): for row in rows: self.writerow(row) with open('xxx.csv','rb') as fin, open('lll.csv','wb') as fout: reader = UnicodeReader(fin) writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL) for line in reader: writer.writerow(line)

入力（UTF-8エンコード）：

American,美国人 French,法国人 German,德国人

出力：

"American","美国人" "French","法国人" "German","德国人"

weaming · Answer

Python2のstrは実際にはbytesであるためです。したがって、unicodeをcsvに書き込みたい場合は、utf-8エンコードを使用してunicodeをstrにエンコードする必要があります。

def py2_unicode_to_str(u): # unicode is only exist in python2 assert isinstance(u, unicode) return u.encode('utf-8')

class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='Excel', *args, **kwds)を使用します。

py2
- csvfile：open(fp, 'w')
- utf-8 でエンコードされたbytesにキーと値を渡します
  - writer.writerow({py2_unicode_to_str(k): py2_unicode_to_str(v) for k,v in row.items()})
py3
- csvfile：open(fp, 'w')
- strを含む通常の辞書をrowとしてwriter.writerow(row)に渡す

最後にコード

import sys is_py2 = sys.version_info[0] == 2 def py2_unicode_to_str(u): # unicode is only exist in python2 assert isinstance(u, unicode) return u.encode('utf-8') with open('file.csv', 'w') as f: if is_py2: data = {u'Python中国': u'Python中国', u'Python中国2': u'Python中国2'} # just one more line to handle this data = {py2_unicode_to_str(k): py2_unicode_to_str(v) for k, v in data.items()} fields = list(data[0]) writer = csv.DictWriter(f, fieldnames=fields) for row in data: writer.writerow(row) else: data = {'Python中国': 'Python中国', 'Python中国2': 'Python中国2'} fields = list(data[0]) writer = csv.DictWriter(f, fieldnames=fields) for row in data: writer.writerow(row)

結論

Python3では、ユニコードstrを使用します。

Python2では、unicodeハンドルテキストを使用し、I/Oが発生したときにstrを使用します。

Joe S · Answer

上記のマークには応答できませんでしたが、セル内のデータがUnicodeでない場合、つまりfloatまたはintデータの場合に発生するエラーを修正する1つの変更を加えました。この行をUnicodeWriter関数に置き換えました： "self.writer.writerow（[s.encode（" utf-8 "）if type（s）== types.UnicodeType else s for s for row]）" " ：

class UnicodeWriter: def __init__(self, f, dialect=csv.Excel, encoding="utf-8-sig", **kwds): self.queue = cStringIO.StringIO() self.writer = csv.writer(self.queue, dialect=dialect, **kwds) self.stream = f self.encoder = codecs.getincrementalencoder(encoding)() def writerow(self, row): '''writerow(unicode) -> None This function takes a Unicode string and encodes it to the output. ''' self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row]) data = self.queue.getvalue() data = data.decode("utf-8") data = self.encoder.encode(data) self.stream.write(data) self.queue.truncate(0) def writerows(self, rows): for row in rows: self.writerow(row)

また、「タイプをインポート」する必要があります。

tozCSS · Answer

私はまったく同じ問題を抱えていました。答えは、あなたはすでにそれを正しくやっているということです。 MS Excelの問題です。別のエディターでファイルを開いてみてください。エンコードがすでに成功していることがわかります。 MS Excelを幸せにするには、UTF-8からUTF-16に移行します。これは動作するはずです：

class UnicodeWriter: def __init__(self, f, dialect=csv.Excel_tab, encoding="utf-16", **kwds): # Redirect output to a queue self.queue = StringIO.StringIO() self.writer = csv.writer(self.queue, dialect=dialect, **kwds) self.stream = f # Force BOM if encoding=="utf-16": import codecs f.write(codecs.BOM_UTF16) self.encoding = encoding def writerow(self, row): # Modified from original: now using unicode(s) to deal with e.g. ints self.writer.writerow([unicode(s).encode("utf-8") for s in row]) # Fetch UTF-8 output from the queue ... data = self.queue.getvalue() data = data.decode("utf-8") # ... and reencode it into the target encoding data = data.encode(self.encoding) # strip BOM if self.encoding == "utf-16": data = data[2:] # write to the target stream self.stream.write(data) # empty queue self.queue.truncate(0) def writerows(self, rows): for row in rows: self.writerow(row)