GridSearchCVでテスト/トレインセットを明示的に指定する

Question

Sklearnの cv のGridSearchCVパラメーターについて質問があります。

時間コンポーネントを含むデータを処理しているので、KFold相互検証内でのランダムなシャッフルは賢明ではないと思います。

代わりに、GridSearchCV内のトレーニング、検証、およびテストデータのカットオフを明示的に指定したいと思います。これはできますか？

質問をより明確にするために、これが私が手動でそれを行う方法です。

import numpy as np import pandas as pd from sklearn.linear_model import Ridge np.random.seed(444) index = pd.date_range('2014', periods=60, freq='M') X, y = make_regression(n_samples=60, n_features=3, random_state=444, noise=90.) X = pd.DataFrame(X, index=index, columns=list('abc')) y = pd.Series(y, index=index, name='y') # Train on the first 30 samples, validate on the next 10, test on # the final 10. X_train, X_val, X_test = np.array_split(X, [35, 50]) y_train, y_val, y_test = np.array_split(y, [35, 50]) param_grid = {'alpha': np.linspace(0, 1, 11)} model = None best_param_ = None best_score_ = -np.inf # Manual implementation for alpha in param_grid['alpha']: ridge = Ridge(random_state=444, alpha=alpha).fit(X_train, y_train) score = ridge.score(X_val, y_val) if score > best_score_: best_score_ = score best_param_ = alpha model = ridge print('Optimal alpha parameter: {:0.2f}'.format(best_param_)) print('Best score (on validation data): {:0.2f}'.format(best_score_)) print('Test set score: {:.2f}'.format(model.score(X_test, y_test))) # Optimal alpha parameter: 1.00 # Best score (on validation data): 0.64 # Test set score: 0.22

ここでのプロセスは次のとおりです。

XとYの両方について、トレーニングセット、検証セット、およびテストセットが必要です。トレーニングセットは、時系列の最初の35サンプルです。検証セットは次の15サンプルです。テストセットは最後の10です。
トレインセットと検証セットは、リッジ回帰内の最適なalphaパラメーターを決定するために使用されます。ここでは、（0.0、0.1、...、0.9、1.0）のalphasをテストします。
テストセットは、「実際の」テストのために、目に見えないデータとして保持されます。

とにかく...私はこのようなことをしようとしているようですが、ここでcvに何を渡すかわかりません：

from sklearn.model_selection import GridSearchCV grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv= ???) grid_search.fit(...?)

私が解釈に問題を抱えているドキュメントは、次のように指定しています。

cv：int、相互検証ジェネレーター、または反復可能なオプション

交差検定分割戦略を決定します。 cvの可能な入力は次のとおりです。

なし、デフォルトの3分割交差検定を使用するには、

整数、（層化）Kフォールドのフォールド数を指定するには、

相互検証ジェネレーターとして使用されるオブジェクト。

反復可能な降伏列車、テスト分割。

整数/なし入力の場合、推定量が分類子であり、yがバイナリまたはマルチクラスの場合、StratifiedKFoldが使用されます。それ以外の場合はすべて、KFoldが使用されます。

Vivek Kumar · Accepted Answer

@MaxUが言ったように、GridSearchCVに分割を処理させる方が良いですが、質問で設定したように分割を強制したい場合は、 PredefinedSplit whichを使用できます。これはまさにそのことです。

したがって、コードに次の変更を加える必要があります。

_# Here X_test, y_test is the untouched data # Validation data (X_val, y_val) is currently inside X_train, which will be split using PredefinedSplit inside GridSearchCV X_train, X_test = np.array_split(X, [50]) y_train, y_test = np.array_split(y, [50]) # The indices which have the value -1 will be kept in train. train_indices = np.full((35,), -1, dtype=int) # The indices which have zero or positive values, will be kept in test test_indices = np.full((15,), 0, dtype=int) test_fold = np.append(train_indices, test_indices) print(test_fold) # OUTPUT: array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) from sklearn.model_selection import PredefinedSplit ps = PredefinedSplit(test_fold) # Check how many splits will be done, based on test_fold ps.get_n_splits() # OUTPUT: 1 for train_index, test_index in ps.split(): print("TRAIN:", train_index, "TEST:", test_index) # OUTPUT: ('TRAIN:', array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]), 'TEST:', array([35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])) # And now, send this `ps` to cv param in GridSearchCV from sklearn.model_selection import GridSearchCV grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv=ps) # Here, send the X_train and y_train grid_search.fit(X_train, y_train) _

fit()に送信されたX_train、y_trainは、定義した分割を使用してtrainとtest（この場合はval）に分割されるため、Ridgeはインデックス[0:35]からの元のデータでトレーニングされます。 [35:50]でテスト済み。

これが動作をクリアすることを願っています。

Bert Kellerman · Answer

TimeSeriesSplit を試しましたか？

時系列データを分割するために明示的に作成されました。

tscv = TimeSeriesSplit(n_splits=3) grid_search = GridSearchCV(clf, param_grid, cv=tscv.split(X))

rohan chikorde · Answer

時系列データでは、kfold cvがデータをシャッフルし、系列内のパターンが失われるため、Kfoldは適切なアプローチではありません。これがアプローチです

import xgboost as xgb from sklearn.model_selection import TimeSeriesSplit, GridSearchCV import numpy as np X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T y = np.array([1, 6, 7, 1, 2, 3]) tscv = TimeSeriesSplit(n_splits=2) model = xgb.XGBRegressor() param_search = {'max_depth' : [3, 5]} my_cv = TimeSeriesSplit(n_splits=2).split(X) gsearch = GridSearchCV(estimator=model, cv=my_cv, param_grid=param_search) gsearch.fit(X, y)

リファレンス--- GridSearchCVオブジェクトでTimeSeriesSplitを使用してscikit-learnでモデルを調整するにはどうすればよいですか？