列の値を別の列に効率的に置き換えるPandas DataFrame

Question

私はPandas次のようなDataFrameを持っています：

 col1 col2 col3 1 0.2 0.3 0.3 2 0.2 0.3 0.3 3 0 0.4 0.4 4 0 0 0.3 5 0 0 0 6 0.1 0.4 0.4

col1の値を2番目の列の値（col2）で置き換えるのは、col1の値が0に等しい場合のみで、その後（残りのゼロ値の場合）もう一度ですが、3番目の列（col3）があります。望ましい結果は次の結果です。

 col1 col2 col3 1 0.2 0.3 0.3 2 0.2 0.3 0.3 3 0.4 0.4 0.4 4 0.3 0 0.3 5 0 0 0 6 0.1 0.4 0.4

pd.replace関数を使用して実行しましたが、速度が遅すぎるようです。

df.col1.replace(0,df.col2,inplace=True) df.col1.replace(0,df.col3,inplace=True)

pd.replace関数の代わりに他の関数を使用して、それを行うより速い方法はありますか？

root · Accepted Answer

np.where の方が高速です。 replaceで使用したのと同様のパターンを使用します。

df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1']) df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])

ただし、ネストされたnp.whereはわずかに高速です。

df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])

タイミング

次のセットアップを使用して、より大きなサンプルDataFrameおよびタイミング関数を作成します。

df = pd.concat([df]*10**4, ignore_index=True) def root_nested(df): df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1']) return df def root_split(df): df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1']) df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1']) return df def pir2(df): df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0) return df def pir2_2(df): slc = (df.values != 0).argmax(axis=1) return df.values[np.arange(slc.shape[0]), slc] def andrew(df): df.col1[df.col1 == 0] = df.col2 df.col1[df.col1 == 0] = df.col3 return df def pablo(df): df['col1'] = df['col1'].replace(0,df['col2']) df['col1'] = df['col1'].replace(0,df['col3']) return df

次のタイミングを取得します。

%timeit root_nested(df.copy()) 100 loops, best of 3: 2.25 ms per loop %timeit root_split(df.copy()) 100 loops, best of 3: 2.62 ms per loop %timeit pir2(df.copy()) 100 loops, best of 3: 6.25 ms per loop %timeit pir2_2(df.copy()) 1 loop, best of 3: 2.4 ms per loop %timeit andrew(df.copy()) 100 loops, best of 3: 8.55 ms per loop

私はあなたの方法のタイミングをとろうとしましたが、完了せずに数分間実行されていました。比較として、6行の例のDataFrame（上記でテストしたほど大きなものではありません）でメソッドをタイミング調整するのに12.8ミリ秒かかりました。

Andrew · Answer

より高速かどうかはわかりませんが、データフレームをスライスして目的の結果を得ることができるのは正しいことです。

df.col1[df.col1 == 0] = df.col2 df.col1[df.col1 == 0] = df.col3 print(df)

出力：

 col1 col2 col3 0 0.2 0.3 0.3 1 0.2 0.3 0.3 2 0.4 0.4 0.4 3 0.3 0.0 0.3 4 0.0 0.0 0.0 5 0.1 0.4 0.4

あるいは、もっと簡潔にしたい場合（より高速かどうかはわかりませんが）、あなたがしたことと私がしたことを組み合わせることができます。

df.col1[df.col1 == 0] = df.col2.replace(0, df.col3) print(df)

出力：

 col1 col2 col3 0 0.2 0.3 0.3 1 0.2 0.3 0.3 2 0.4 0.4 0.4 3 0.3 0.0 0.3 4 0.0 0.0 0.0 5 0.1 0.4 0.4

piRSquared · Answer

pd.DataFrame.whereおよびpd.DataFrame.bfillを使用したアプローチ

df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0) df

np.argmaxを使用する別のアプローチ

def pir2(df): slc = (df.values != 0).argmax(axis=1) return df.values[np.arange(slc.shape[0]), slc]

numpyを使用してスライスするより良い方法があることを知っています。現時点では考えられません。