Pandas pd.cut（）-日時列/シリーズのビニング

Question

Pd.cut（）を使用してビンを実行しようとしていますが、かなり複雑です-

同僚から、次のようなレポートの日付が記載された複数のファイルが送信されます。

_ '03-16-2017 to 03-22-2017' '03-23-2017 to 03-29-2017' '03-30-2017 to 04-05-2017' _

これらはすべて1つのデータフレームに結合され、列名df ['Filedate']が付けられるため、ファイル内のすべてのレコードに正しいファイル日付が設定されます。

最終日はカットオフポイントであるため、最終日を文字列として3/22/2017、3/29/2017、4/05/2017に変換する新しい列df ['Filedate_bin']を作成しました。

次に、リストを作成しました：Filedate_bin_list = df.Filedate_bin.unique（）。その結果、ビンとして使用したい文字列のカットオフ日付の一意のリストがあります。

さまざまなデータをデータフレームにインポートすると、トランザクションの日付の列があります：2017年3月28日、2017年3月29日、2017年3月30日、2017年4月1日、2017年4月2日など。ビンは難しいです、それは試みました：

_df['bin'] = pd.cut(df.Processed_date, Filedate_bin_list) _

受信_TypeError: unsupported operand type for -: 'str' and 'str'_

戻って、Filedate_binをdatetime、format = '％m /％d /％Y'に変換してみて、

TypeError: Cannot cast ufunc less input from dtype('<m8[ns]') to dtype ('<m8') with casting rule 'same_kind'.

私のprocessed_date（s）をテキストビンにビンするより良い方法はありますか？

2017年3月27日から2017年3月23日から2017年3月29日までの処理日を結び付けようとしています

MaxU · Accepted Answer

UPDATE： Pandas v0.20.1（2017年5月5日） _pd.cut_および_pd.qcut_からdatetime64およびtimedelta64dtypesをサポートします（ GH14714 、 GH14798 ）。

ありがとうこれをチェックしてくれて@ lighthouse65！

古い答え：

このアプローチを検討してください。

_df = pd.DataFrame(pd.date_range('2000-01-02', freq='1D', periods=15), columns=['Date']) bins_dt = pd.date_range('2000-01-01', freq='3D', periods=6) bins_str = bins_dt.astype(str).values labels = ['({}, {}]'.format(bins_str[i-1], bins_str[i]) for i in range(1, len(bins_str))] df['cat'] = pd.cut(df.Date.astype(np.int64)//10**9, bins=bins_dt.astype(np.int64)//10**9, labels=labels) _

結果：

_In [59]: df Out[59]: Date cat 0 2000-01-02 (2000-01-01, 2000-01-04] 1 2000-01-03 (2000-01-01, 2000-01-04] 2 2000-01-04 (2000-01-01, 2000-01-04] 3 2000-01-05 (2000-01-04, 2000-01-07] 4 2000-01-06 (2000-01-04, 2000-01-07] 5 2000-01-07 (2000-01-04, 2000-01-07] 6 2000-01-08 (2000-01-07, 2000-01-10] 7 2000-01-09 (2000-01-07, 2000-01-10] 8 2000-01-10 (2000-01-07, 2000-01-10] 9 2000-01-11 (2000-01-10, 2000-01-13] 10 2000-01-12 (2000-01-10, 2000-01-13] 11 2000-01-13 (2000-01-10, 2000-01-13] 12 2000-01-14 (2000-01-13, 2000-01-16] 13 2000-01-15 (2000-01-13, 2000-01-16] 14 2000-01-16 (2000-01-13, 2000-01-16] In [60]: df.dtypes Out[60]: Date datetime64[ns] cat category dtype: object _

説明：

df.Date.astype(np.int64)//10**9-datetime値をUNIXエポックに変換します（タイムスタンプ-_1970-01-01 00:00:00_からの秒数）：

_In [65]: df.Date.astype(np.int64)//10**9 Out[65]: 0 946771200 1 946857600 2 946944000 3 947030400 4 947116800 5 947203200 6 947289600 7 947376000 8 947462400 9 947548800 10 947635200 11 947721600 12 947808000 13 947894400 14 947980800 Name: Date, dtype: int64 _

同じことがbinsにも当てはまります。

_In [66]: bins_dt.astype(np.int64)//10**9 Out[66]: Int64Index([946684800, 946944000, 947203200, 947462400, 947721600, 947980800], dtype='int64') _

ラベル：

_In [67]: labels Out[67]: ['(2000-01-01, 2000-01-04]', '(2000-01-04, 2000-01-07]', '(2000-01-07, 2000-01-10]', '(2000-01-10, 2000-01-13]', '(2000-01-13, 2000-01-16]'] _