pandas.read_csvのdtypeとコンバーターの違いは何ですか？

Question

pandas関数read_csv（）は.csvファイルを読み取ります。そのドキュメントは here です

ドキュメントによると、次のことがわかっています。

dtype：型名または列の辞書->型、デフォルトなしデータまたは列のデータ型。例えば。 {「a」：np.float64、「b」：np.int32}（engine = 'python'ではサポートされていません）

そして

converters：dict、デフォルトなし特定の列の値を変換するための関数の辞書。キーは整数または列ラベルのいずれかです

この関数を使用する場合、pandas.read_csv('file',dtype=object)またはpandas.read_csv('file',converters=object)のいずれかを呼び出すことができます。明らかに、コンバーター、その名前はデータ型が変換されると言うことができますが、dtypeの場合は疑問に思いますか？

EdChum · Accepted Answer

セマンティックの違いは、dtypeを使用すると、たとえば数値型または文字列型として値を処理する方法を指定できることです。

コンバーターを使用すると、変換関数を使用して入力データを解析し、目的のdtypeに変換できます。たとえば、文字列値をdatetimeまたは他の目的のdtypeに解析します。

ここでは、pandasがタイプを探そうとすることがわかります。

In [2]: df = pd.read_csv(io.StringIO(t)) t="""int,float,date,str 001,3.31,2015/01/01,005""" df = pd.read_csv(io.StringIO(t)) df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null int64 float 1 non-null float64 date 1 non-null object str 1 non-null int64 dtypes: float64(1), int64(2), object(1) memory usage: 40.0+ bytes

上記から、001と005はint64として扱われますが、日付文字列はstrのままです。

すべてがobjectであると言えば、本質的にすべてがstrです：

In [3]: df = pd.read_csv(io.StringIO(t), dtype=object).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null object float 1 non-null object date 1 non-null object str 1 non-null object dtypes: object(4) memory usage: 40.0+ bytes

ここでは、int列をstrに強制し、parse_datesにdate_parserを使用して日付列を解析するように指示します。

In [6]: pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null object float 1 non-null float64 date 1 non-null datetime64[ns] str 1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 40.0+ bytes

同様に、日付を変換するto_datetime関数を渡すこともできます。

In [5]: pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null int64 float 1 non-null float64 date 1 non-null datetime64[ns] str 1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(2) memory usage: 40.0 bytes