相互検証とGridSearchCVでSMOTEを実装する方法

Question

私はPythonに比較的慣れていません。 SMOTEの実装を適切なパイプラインに改善するのを手伝ってくれませんか？私が欲しいのは、モデルがバランスの取れたデータセットでトレーニングされ、バランスの取れていない部分で評価されるように、すべてのk倍反復のトレーニングセットにオーバーサンプリングとアンダーサンプリングを適用することです。問題は、その場合、評価とグリッド検索に使い慣れたsklearnインターフェイスを使用できないことです。

model_selection.RandomizedSearchCVのようなものを作成することは可能ですか？これについての私の見解：

df = pd.read_csv("Imbalanced_data.csv") #Load the data set X = df.iloc[:,0:64] X = X.values y = df.iloc[:,64] y = y.values n_splits = 2 n_measures = 2 #Recall and AUC kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples kf.get_n_splits(X) clf_rf = RandomForestClassifier(n_estimators=25, random_state=1) s =(n_splits,n_measures) scores = np.zeros(s) for train_index, test_index in kf.split(X,y): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1) smote_enn = SMOTEENN(smote = sm) x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train) clf_rf.fit(x_train_res, y_train_res) y_pred = clf_rf.predict(X_test,y_test) scores[test_index,1] = recall_score(y_test, y_pred) scores[test_index,2] = auc(y_test, y_pred)

Vivek Kumar · Accepted Answer

パイプラインオブジェクトを確認する必要があります。 imbalanced-learnには Pipeline があり、scikit-learnパイプラインを拡張して、skittのfit_predict（）、fit_transform（）、predict（）メソッドに加えて、fit_sample（）およびsample（）メソッドに適応します-学ぶ。

この例をここで見てください：

http://contrib.scikit-learn.org/imbalanced-learn/stable/auto_examples/pipeline/plot_pipeline_classification.html#sphx-glr-auto-examples-pipeline-plot-pipeline-classification-py

あなたのコードでは、これをしたいでしょう：

from imblearn.pipeline import make_pipeline, Pipeline smote_enn = SMOTEENN(smote = sm) clf_rf = RandomForestClassifier(n_estimators=25, random_state=1) pipeline = make_pipeline(smote_enn, clf_rf) OR pipeline = Pipeline([('smote_enn', smote_enn), ('clf_rf', clf_rf)])

次に、このpipelineオブジェクトを、通常のオブジェクトとしてscikit-learnのGridSearchCV、RandomizedSearchCVまたは他の相互検証ツールに渡すことができます。

kf = StratifiedKFold(n_splits=n_splits) random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=1000, cv = kf)

Matti Lyra · Answer

これは法案に合うように見えます http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html

独自のtransformer（ http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html ）を作成すると、fitを呼び出すとバランスの取れたデータセットが返されます（おそらくStratifiedKFoldから取得したものですが）predictを呼び出すと、これはテストデータに対して発生することで、SMOTEを呼び出します。