web-dev-qa-db-ja.com

現在の単語の分類に別の機能(テキストの長さ)を追加するにはどうすればよいですか? Scikit-learn

私はテキストを分類するために単語の袋を使用しています。それはうまく機能していますが、Wordではない機能を追加する方法を考えています。

これが私のサンプルコードです。

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "new york is also called the big Apple",
                    "nyc is Nice",
                    "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
                    "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
                    "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
                    "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = [[0],[0],[0],[0],[1],[1],[1],[1]]

X_test = np.array(["it's a Nice day in nyc",
                   'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
                   ])   
target_names = ['Class 1', 'Class 2']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_df=1,max_df=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in Zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

ロンドンに関するテキストは、ニューヨークに関するテキストよりもはるかに長くなる傾向があることは明らかです。テキストの長さを機能として追加するにはどうすればよいですか?別の分類方法を使用してから、2つの予測を組み合わせる必要がありますか?言葉の袋と一緒にそれを行う方法はありますか?いくつかのサンプルコードは素晴らしいでしょう-私は機械学習とscikitlearnに非常に慣れていません。

16
aaravam

コメントに示されているように、これはFunctionTransformerFeaturePipeline、およびFeatureUnionの組み合わせです。

import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import FunctionTransformer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "new york is also called the big Apple",
                    "nyc is Nice",
                    "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
                    "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
                    "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
                    "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]])

X_test = np.array(["it's a Nice day in nyc",
                   'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
                   ])   
target_names = ['Class 1', 'Class 2']


def get_text_length(x):
    return np.array([len(t) for t in x]).reshape(-1, 1)

classifier = Pipeline([
    ('features', FeatureUnion([
        ('text', Pipeline([
            ('vectorizer', CountVectorizer(min_df=1,max_df=2)),
            ('tfidf', TfidfTransformer()),
        ])),
        ('length', Pipeline([
            ('count', FunctionTransformer(get_text_length, validate=False)),
        ]))
    ])),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
predicted

これにより、分類子が使用する特徴にテキストの長さが追加されます。

10
Ken Syme

追加したい新機能は数値だと思います。これが私の論理です。まず、TfidfTransformerなどを使用してテキストをスパースに変換します。次に、スパース表現をpandas DataFrameに変換し、数値であると想定する新しい列を追加します。最後に、sparseまたはその他の使いやすいモジュールを使用して、データフレームをscipy行列に変換し直すことができます。あなたのデータは、pandas DataFrame'Text Column'を含むdatasetと呼ばれる'Numeric Column'にあると思います。ここにいくつかのコードがあります。

dataset = pd.DataFrame({'Text Column':['Sample Text1','Sample Text2'], 'Numeric Column': [2,1]})
dataset.head()

        Numeric Column   Text Column
0                   2    Sample Text1
1                   1    Sample Text2

from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from scipy import sparse

tv = TfidfVectorizer(min_df = 0.05, max_df = 0.5, stop_words = 'english')
X = tv.fit_transform(dataset['Text column'])
vocab = tv.get_feature_names()

X1 = pd.DataFrame(X.toarray(), columns = vocab)
X1['Numeric Column'] = dataset['Numeric Column']


X_sparse = sparse.csr_matrix(X1.values)

最後に、あなたはしたいかもしれません。

print(X_sparse.shape)
print(X.shape)

新しい列が正常に追加されたことを確認します。これがお役に立てば幸いです。

0
Samuel Nde