Pandas to StatsmodelsへのOLSの非推奨のローリングウィンドウオプション

Question

タイトルが示すように、olsコマンドのローリング関数オプションはPandasがstatsmodelsに移行されましたか？見つからないようです。Pandasドゥームが働いていると私に伝えます：

FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html model = pd.ols(y=series_1, x=mmmm, window=50)

実際、次のようなことをした場合：

import statsmodels.api as sm model = sm.OLS(series_1, mmmm, window=50).fit() print(model.summary())

結果は得られますが（ウィンドウはコードの実行に影響を与えません）、期間全体で実行された回帰のパラメーターのみが取得され、処理が想定されている各ローリング期間の一連のパラメーターは取得されません。

Brad Solomon · Accepted Answer

パンダの廃止予定のolsを模倣するように設計されたMovingOLSモジュールを作成しました。こちらです。

3つのコアクラスがあります。

OLS：静的（単一ウィンドウ）通常の最小二乗回帰。出力はNumPy配列です
RollingOLS：ローリング（マルチウィンドウ）通常の最小二乗回帰。出力は、より高次元のNumPy配列です。
PandasRollingOLS：RollingOLSの結果をpandas Series＆DataFramesにラップします。非推奨の外観を模倣するように設計されていますpandasモジュール。

このモジュールは package （現在、PyPiへのアップロード処理中です）の一部であり、1つのパッケージ間インポートが必要です。

上記の最初の2つのクラスは完全にNumPyで実装され、主に行列代数を使用します。 RollingOLSは、ブロードキャストも幅広く利用しています。属性は主に統計モデルのOLS RegressionResultsWrapperを模倣しています。

例：

import urllib.parse import pandas as pd from pyfinance.ols import PandasRollingOLS # You can also do this with pandas-datareader; here's the hard way url = "https://fred.stlouisfed.org/graph/fredgraph.csv" syms = { "TWEXBMTH" : "usd", "T10Y2YM" : "term_spread", "GOLDAMGBD228NLBM" : "gold", } params = { "fq": "Monthly,Monthly,Monthly", "id": ",".join(syms.keys()), "cosd": "2000-01-01", "coed": "2019-02-01", } data = pd.read_csv( url + "?" + urllib.parse.urlencode(params, safe=","), na_values={"."}, parse_dates=["DATE"], index_col=0 ).pct_change().dropna().rename(columns=syms) print(data.head()) # usd term_spread gold # DATE # 2000-02-01 0.012580 -1.409091 0.057152 # 2000-03-01 -0.000113 2.000000 -0.047034 # 2000-04-01 0.005634 0.518519 -0.023520 # 2000-05-01 0.022017 -0.097561 -0.016675 # 2000-06-01 -0.010116 0.027027 0.036599 y = data.usd x = data.drop('usd', axis=1) window = 12 # months model = PandasRollingOLS(y=y, x=x, window=window) print(model.beta.head()) # Coefficients excluding the intercept # term_spread gold # DATE # 2001-01-01 0.000033 -0.054261 # 2001-02-01 0.000277 -0.188556 # 2001-03-01 0.002432 -0.294865 # 2001-04-01 0.002796 -0.334880 # 2001-05-01 0.002448 -0.241902 print(model.fstat.head()) # DATE # 2001-01-01 0.136991 # 2001-02-01 1.233794 # 2001-03-01 3.053000 # 2001-04-01 3.997486 # 2001-05-01 3.855118 # Name: fstat, dtype: float64 print(model.rsq.head()) # R-squared # DATE # 2001-01-01 0.029543 # 2001-02-01 0.215179 # 2001-03-01 0.404210 # 2001-04-01 0.470432 # 2001-05-01 0.461408 # Name: rsq, dtype: float64

citynorman · Answer

Sklearnを使用したローリングベータ

import pandas as pd from sklearn import linear_model def rolling_beta(X, y, idx, window=255): assert len(X)==len(y) out_dates = [] out_beta = [] model_ols = linear_model.LinearRegression() for iStart in range(0, len(X)-window): iEnd = iStart+window model_ols.fit(X[iStart:iEnd], y[iStart:iEnd]) #store output out_dates.append(idx[iEnd]) out_beta.append(model_ols.coef_[0][0]) return pd.DataFrame({'beta':out_beta}, index=out_dates) df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)

Pythonic · Answer

完全を期すために、計算を回帰係数と最終推定のみに制限する、より高速なnumpy- onlyソリューションを追加します。

Numpyローリング回帰関数

_import numpy as np def rolling_regression(y, x, window=60): """ y and x must be pandas.Series """ # === Clean-up ============================================================ x = x.dropna() y = y.dropna() # === Trim acc to shortest ================================================ if x.index.size > y.index.size: x = x[y.index] else: y = y[x.index] # === Verify enough space ================================================= if x.index.size < window: return None else: # === Add a constant if needed ======================================== X = x.to_frame() X['c'] = 1 # === Loop... this can be improved ==================================== estimate_data = [] for i in range(window, x.index.size+1): X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster y_slice = y.values[i-window:i] coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice) estimate_data.append(coeff[0] * x.values[window-1] + coeff[1]) # === Assemble ======================================================== estimate = pandas.Series(data=estimate_data, index=x.index[window-1:]) return estimate _

注意事項

回帰の最終推定のみを必要とする特定のケースでの使用では、x.rolling(window=60).apply(my_ols)は多少遅いように見えます

念のため、回帰の係数は wikipedia's least squares page で読むことができるように、行列積として計算できます。 numpyの行列乗算によるこのアプローチは、statsmodelsのolを使用するよりも、プロセスをいくらか高速化できます。この製品は_coeff = ..._で始まる行で表されます