web-dev-qa-db-ja.com

python sklearn多重線形回帰表示r-squared

多重線形回帰式を計算し、調整されたR-2乗を確認したい。スコア関数を使用すると、rの2乗を見ることができますが、調整されていません。

import pandas as pd #import the pandas module
import numpy as np
df = pd.read_csv ('/Users/jeangelj/Documents/training/linexdata.csv', sep=',')
df
       AverageNumberofTickets   NumberofEmployees   ValueofContract Industry
   0              1                    51                  25750    Retail
   1              9                    68                  25000    Services
   2             20                    67                  40000    Services
   3              1                   124                  35000    Retail
   4              8                   124                  25000    Manufacturing
   5             30                   134                  50000    Services
   6             20                   157                  48000    Retail
   7              8                   190                  32000    Retail
   8             20                   205                  70000    Retail
   9             50                   230                  75000    Manufacturing
  10             35                   265                  50000    Manufacturing
  11             65                   296                  75000    Services
  12             35                   336                  50000    Manufacturing
  13             60                   359                  75000    Manufacturing
  14             85                   403                  81000    Services
  15             40                   418                  60000    Retail
  16             75                   437                  53000    Services
  17             85                   451                  90000    Services
  18             65                   465                  70000    Retail
  19             95                   491                  100000   Services

from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)
model.score(X, y)
>>0.87764337132340009

手動で確認したところ、0.87764はRの2乗です。一方、0.863248は調整済みのR 2乗です。

12
jeangelj

R^2およびadjusted R^2を計算するにはさまざまな方法がありますが、そのうちのいくつかを以下に示します(提供されたデータで計算)。

from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)

# compute with formulas from the theory
yhat = model.predict(X)
SS_Residual = sum((y-yhat)**2)
SS_Total = sum((y-np.mean(y))**2)
r_squared = 1 - (float(SS_Residual))/SS_Total
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
print r_squared, adjusted_r_squared
# 0.877643371323 0.863248473832

# compute with sklearn linear_model, although could not find any function to compute adjusted-r-square directly from documentation
print model.score(X, y), 1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)
# 0.877643371323 0.863248473832 

# compute with statsmodels, by adding intercept manually
import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(y, X1).fit()
#print dir(result)
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832

# compute with statsmodels, another way, using formula
import statsmodels.formula.api as sm
result = sm.ols(formula="AverageNumberofTickets ~ NumberofEmployees + ValueofContract", data=df).fit()
#print result.summary()
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832
34
Sandipan Dey