statsmodel.formula.apiとstatsmodel.apiを使用したOLS

Question

Statsmodel.formula.apiのolとstatsmodel.apiのolの違いを誰かに説明できますか？

ISLRテキストのAdvertisingデータを使用して、両方を使用してolsを実行し、異なる結果を得ました。次に、scikit-learnのLinearRegressionと比較しました。

import numpy as np import pandas as pd import statsmodels.formula.api as smf import statsmodels.api as sm from sklearn.linear_model import LinearRegression df = pd.read_csv("C:\...\Advertising.csv") x1 = df.loc[:,['TV']] y1 = df.loc[:,['Sales']] print "Statsmodel.Formula.Api Method" model1 = smf.ols(formula='Sales ~ TV', data=df).fit() print model1.params print "
Statsmodel.Api Method" model2 = sm.OLS(y1, x1) results = model2.fit() print results.params print "
Sci-Kit Learn Method" model3 = LinearRegression() model3.fit(x1, y1) print model3.coef_ print model3.intercept_

出力は次のとおりです。

Statsmodel.Formula.Api Method Intercept 7.032594 TV 0.047537 dtype: float64 Statsmodel.Api Method TV 0.08325 dtype: float64 Sci-Kit Learn Method [[ 0.04753664]] [ 7.03259355]

Statsmodel.apiメソッドは、statsmodel.formula.apiおよびscikit-learnメソッドとは異なるTVのパラメーターを返します。

Statsmodel.apiがどのようなolsアルゴリズムを実行しているのですか？この質問への回答に役立つドキュメントへのリンクはありますか？

stellasia · Accepted Answer

違いは、傍受の有無によるものです。

statsmodels.formula.api、Rアプローチと同様に、定数がデータに自動的に追加され、切片がフィッティングされます
statsmodels.api、自分で定数を追加する必要があります（ドキュメントはこちらを参照）。 statsmodels.apiの add_constant を使用してみてください
```
x1 = sm.add_constant(x1) 
```

Brad Solomon · Answer

今日この問題に出くわし、statsmodelsのドキュメントがおそらく少しあいまいであるため、@ stellasiaの回答について詳しく説明したいと思いました。

実際のRスタイルの文字列式 を使用していない限り、OLSをインスタンス化するときに、 statsmodels.formulas.apiとプレーンstatsmodels.apiの両方の下での定数（文字通り1の列）。 @ChetanはここでRスタイルのフォーマット（formula='Sales ~ TV'）を使用しているため、この微妙な問題には遭遇しませんが、Pythonの知識はあるがRの背景がない人にとっては、紛らわしい。

さらに、モデルの構築時に hasconst パラメータを指定するかどうかは重要ではありません。（これはちょっとばかげています。）つまり、Rスタイルの文字列式を使用していない限り、hasconstは無視されます。

[示す] RHSにユーザー指定の定数が含まれているかどうか

脚注で

式を使用しない限り、モデルによって定数は追加されません。

以下の例は、Rスタイルの文字列式を使用しない場合、.formulas.apiと.apiの両方にユーザーが追加した1の列ベクトルが必要であることを示しています。

# Generate some relational data np.random.seed(123) nobs = 25 x = np.random.random((nobs, 2)) x_with_ones = sm.add_constant(x, prepend=False) beta = [.1, .5, 1] e = np.random.random(nobs) y = np.dot(x_with_ones, beta) + e

次に、xとyをExcelにスローし、[データ]> [データ分析]> [回帰]を実行して、[定数がゼロ]がオフになっていることを確認します。次の係数が得られます。

Intercept 1.497761024 X Variable 1 0.012073045 X Variable 2 0.623936056

ここで、xをhasconstに設定して、x_with_onesまたはstatsmodels.formula.apiのいずれかで、statsmodels.apiではなくNoneでこの回帰を実行してみてください。 True、またはFalse。これらの6つのシナリオのそれぞれで、インターセプトが返されないことがわかります。（パラメーターは2つしかありません。）

import statsmodels.formula.api as smf import statsmodels.api as sm print('smf models') print('-' * 10) for hc in [None, True, False]: model = smf.OLS(endog=y, exog=x, hasconst=hc).fit() print(model.params) # smf models # ---------- # [ 1.46852293 1.8558273 ] # [ 1.46852293 1.8558273 ] # [ 1.46852293 1.8558273 ]

xに追加された1.0sの列ベクトルで正しく実行されるようになりました。ここではsmfを使用できますが、数式を使用していない場合は本当に必要ありません。

print('sm models') print('-' * 10) for hc in [None, True, False]: model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit() print(model.params) # sm models # ---------- # [ 0.01207304 0.62393606 1.49776102] # [ 0.01207304 0.62393606 1.49776102] # [ 0.01207304 0.62393606 1.49776102]