パイプライン：複数の分類子？

Question

PythonのパイプラインとGridSearchCVで次の例を読みました： http://www.davidsbatista.net/blog/2017/04/01/document_classification/

ロジスティック回帰：

pipeline = Pipeline([ ('tfidf', TfidfVectorizer(stop_words=stop_words)), ('clf', OneVsRestClassifier(LogisticRegression(solver='sag')), ]) parameters = { 'tfidf__max_df': (0.25, 0.5, 0.75), 'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)], "clf__estimator__C": [0.01, 0.1, 1], "clf__estimator__class_weight": ['balanced', None], }

SVM：

pipeline = Pipeline([ ('tfidf', TfidfVectorizer(stop_words=stop_words)), ('clf', OneVsRestClassifier(LinearSVC()), ]) parameters = { 'tfidf__max_df': (0.25, 0.5, 0.75), 'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)], "clf__estimator__C": [0.01, 0.1, 1], "clf__estimator__class_weight": ['balanced', None], }

ロジスティック回帰とSVMをoneパイプラインに組み合わせる方法はありますか？たとえば、TfidfVectorizerがあり、複数の分類子に対してテストして、それぞれが最適なモデル/パラメータを出力するようにします。

cgnorthcutt · Answer

これは、任意の分類子と各分類子に対してパラメータの設定を最適化する簡単な方法です。

任意の推定量で機能するスイッチャークラスを作成する

_from sklearn.base import BaseEstimator class ClfSwitcher(BaseEstimator): def __init__( self, estimator = SGDClassifier(), ): """ A Custom BaseEstimator that can switch between classifiers. :param estimator: sklearn object - The classifier """ self.estimator = estimator def fit(self, X, y=None, **kwargs): self.estimator.fit(X, y) return self def predict(self, X, y=None): return self.estimator.predict(X) def predict_proba(self, X): return self.estimator.predict_proba(X) def score(self, X, y): return self.estimator.score(X, y) _

これで、推定パラメーターに何でも渡すことができます。また、次のように、渡した推定器のパラメータを最適化できます。

ハイパーパラメーター最適化を実行する

_from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import SGDClassifier from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV pipeline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', ClfSwitcher()), ]) parameters = [ { 'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss 'tfidf__max_df': (0.25, 0.5, 0.75, 1.0), 'tfidf__stop_words': ['english', None], 'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'), 'clf__estimator__max_iter': [50, 80], 'clf__estimator__tol': [1e-4], 'clf__estimator__loss': ['hinge', 'log', 'modified_huber'], }, { 'clf__estimator': [MultinomialNB()], 'tfidf__max_df': (0.25, 0.5, 0.75, 1.0), 'tfidf__stop_words': [None], 'clf__estimator__alpha': (1e-2, 1e-3, 1e-1), }, ] gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3) gscv.fit(train_data, train_labels) _

_`clfestimatorloss`_の解釈方法

_clf__estimator__loss_は、lossのestimatorパラメータとして解釈されます。ここで、一番上の例のestimator = SGDClassifier()は、clfオブジェクトであるClfSwitcherのパラメータです。

David Batista · Answer

はい、ラッパー関数を作成することでそれを行うことができます。アイデアは、2つのディクショナリを渡すことです。モデルとパラメータです。

次に、GridSearchCVを使用して、テストするすべてのパラメーターでモデルを繰り返し呼び出します。

この例を確認してください。追加の機能が追加されているため、最後に、さまざまなモデル/パラメーターとさまざまなパフォーマンススコアの概要を含むデータフレームを出力できます。

編集：ここに貼り付けるにはコードが多すぎます、ここで完全に機能する例を確認できます：

http://www.davidsbatista.net/blog/2018/02/23/model_optimization/

パイプライン：複数の分類子？

任意の推定量で機能するスイッチャークラスを作成する

ハイパーパラメーター最適化を実行する

_clf__estimator__loss_の解釈方法

_`clfestimatorloss`_の解釈方法