パンダ：データフレームにすばやく適用するにはどうすればよいですか？

Question

これを考慮してくださいpandasここで、CにAを乗算してBを乗算することにより、列floatを計算しています。 apply関数とlambdaを使用すると、特定の条件が満たされます。

import pandas as pd df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9],'B':[9,8,7,6,5,4,3,2,1]}) df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)

期待される結果は次のとおりです。

 A B C 0 1 9 1.0 1 2 8 2.0 2 3 7 3.0 3 4 6 4.0 4 5 5 2.5 5 6 4 2.4 6 7 3 2.1 7 8 2 1.6 8 9 1 0.9

問題は、このコードが遅く、約5600万行のデータフレームでこの操作を実行する必要があることです。

%timeit-上記のラムダ演算の結果は次のとおりです。

1000 loops, best of 3: 1.63 ms per loop

計算時間と、大きなデータフレームでこれを行うときのメモリ使用量から、この操作では計算を実行するときに中間系列を使用すると想定します。

一時的な列を使用するなど、さまざまな方法でそれを定式化しようとしましたが、思いついたすべての代替ソリューションはさらに遅くなっています。

私が必要とする結果を別のより高速な方法で取得する方法はありますか？ numpy？

Divakar · Accepted Answer

パフォーマンスについては、NumPy配列を使用してnp.whereを使用した方がよい場合があります-

a = df.values # Assuming you have two columns A and B df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])

実行時テスト

def numpy_based(df): a = df.values # Assuming you have two columns A and B df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])

タイミング-

In [271]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [272]: %timeit numpy_based(df) 1000 loops, best of 3: 380 µs per loop In [273]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [274]: %timeit df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1)) 100 loops, best of 3: 3.39 ms per loop In [275]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [276]: %timeit df['C'] = np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B']) 1000 loops, best of 3: 1.12 ms per loop In [277]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [278]: %timeit df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1)) 1000 loops, best of 3: 1.19 ms per loop

詳しく見る

NumPyの数値計算機能を詳しく見て、pandasと比較してみましょう-

# Extract out as array (its a view, so not really expensive # .. as compared to the later computations themselves) In [291]: a = df.values In [296]: %timeit df.values 10000 loops, best of 3: 107 µs per loop

ケース＃1：NumPy配列を操作し、numpy.whereを使用：

In [292]: %timeit np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1]) 10000 loops, best of 3: 86.5 µs per loop

繰り返しますが、新しい列に割り当てます：df['C']もそれほど高価ではありません-

In [300]: %timeit df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1]) 1000 loops, best of 3: 323 µs per loop

ケース＃2 pandas dataframeを操作し、その.whereメソッドを使用する（NumPyなし）

In [293]: %timeit df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1)) 100 loops, best of 3: 3.4 ms per loop

ケース＃3：pandas dataframe（NumPy配列なし）で作業しますが、numpy.whereを使用します-

In [294]: %timeit np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B']) 1000 loops, best of 3: 764 µs per loop

ケース＃4 pandas dataframe again（NumPy array no））で作業しますが、numpy.whereを使用します-

In [295]: %timeit np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1)) 1000 loops, best of 3: 830 µs per loop

piRSquared · Answer

純粋pandas
使用pd.Series.where

df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1)) A B C 0 1 9 1.0 1 2 8 2.0 2 3 7 3.0 3 4 6 4.0 4 5 5 2.5 5 6 4 2.4 6 7 3 2.1 7 8 2 1.6 8 9 1 0.9

IanS · Answer

numpy.whereの使用：

df['C'] = numpy.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])

jezrael · Answer

使用する：

df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1)) print (df) A B C 0 1 9 1.0 1 2 8 2.0 2 3 7 3.0 3 4 6 4.0 4 5 5 2.5 5 6 4 2.4 6 7 3 2.1 7 8 2 1.6 8 9 1 0.9