Python：tf-idf-cosine：ドキュメントの類似性を見つける

Question

パート1 ＆パート2 で利用可能なチュートリアルに従っていました。残念なことに、著者には、2つのドキュメント間の距離を実際に見つけるためにコサイン類似度を使用する最終セクションの時間がありませんでした。 stackoverflow からの次のリンクの助けを借りて記事の例をフォローしました。含まれているのは上記のリンクに記載されているコードです（簡単にするため）

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The Sun is bright."] # Documents test_set = ["The Sun in the sky is bright."] # Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()

上記のコードの結果、次のマトリックスがあります

Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [[ 0.70710678 0. 0.70710678 0. ] [ 0. 0.70710678 0. 0.70710678]] [[ 0. 0.57735027 0.57735027 0.57735027]]

コサインの類似性を計算するためにこの出力を使用する方法がわかりません。同じ長さの2つのベクトルに関してコサインの類似性を実装する方法は知っていますが、ここでは2つのベクトルを識別する方法がわかりません。

add-semi-colons · Accepted Answer

@excrayのコメントの助けを借りて、私はそれを答えを見つけることができます。実際に行う必要があるのは、列車データとテストデータを表す2つの配列を反復する単純なforループを書くことです。

まず、簡単なラムダ関数を実装して、コサイン計算の式を保持します。

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

そして、toベクトルを反復する単純なforループを書くだけで、ロジックは「trainVectorizerArrayの各ベクトルについて、testVectorizerArrayのベクトルとのコサイン類似性を見つける必要があります」ということです。

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The Sun is bright."] #Documents test_set = ["The Sun in the sky is bright."] #Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3) for vector in trainVectorizerArray: print vector for testV in testVectorizerArray: print testV cosine = cx(vector, testV) print cosine transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()

出力は次のとおりです。

Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [1 0 1 0] [0 1 1 1] 0.408 [0 1 0 1] [0 1 1 1] 0.816 [[ 0.70710678 0. 0.70710678 0. ] [ 0. 0.70710678 0. 0.70710678]] [[ 0. 0.57735027 0.57735027 0.57735027]]

ogrisel · Answer

まず、カウントフィーチャを抽出し、TF-IDF正規化と行単位のユークリッド正規化を適用する場合、TfidfVectorizerを使用して1回の操作で実行できます。

>>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314x130088 sparse matrix of type '<type 'numpy.float64'>' with 1787553 stored elements in Compressed Sparse Row format>

ここで、1つのドキュメント（たとえば、データセットの最初のドキュメント）および他のすべてのドキュメントのコサイン距離を見つけるには、tfidfベクトルがすでに行正規化されているため、最初のベクトルと他のすべてのドット積を計算する必要があります。 scipyのスパースマトリックスAPIは少し奇妙です（高密度のN次元のnumpy配列ほど柔軟ではありません）。最初のベクトルを取得するには、行列を行ごとにスライスして、単一行の部分行列を取得する必要があります。

>>> tfidf[0:1] <1x130088 sparse matrix of type '<type 'numpy.float64'>' with 89 stored elements in Compressed Sparse Row format>

scikit-learnは、ベクトルコレクションの密表現と疎表現の両方で機能するペアワイズメトリック（機械学習用語ではカーネルとも呼ばれます）を既に提供しています。この場合、線形カーネルとも呼ばれる内積が必要です。

>>> from sklearn.metrics.pairwise import linear_kernel >>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten() >>> cosine_similarities array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602, 0.04457106, 0.03293218])

したがって、上位5つの関連ドキュメントを見つけるには、argsortといくつかの負の配列スライシングを使用できます（ほとんどの関連ドキュメントは、コサイン類似度の値が最も高いため、ソートされたインデックス配列の最後にあります）。

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1] >>> related_docs_indices array([ 0, 958, 10576, 3277]) >>> cosine_similarities[related_docs_indices] array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])

最初の結果は健全性チェックです。クエリドキュメントは、コサイン類似度スコアが1で、次のテキストを持つ最も類似したドキュメントとして検出されます。

>>> print twenty.data[0] From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----

2番目に類似したドキュメントは、元のメッセージを引用する返信であり、多くの一般的な単語が含まれています。

>>> print twenty.data[958] From: rseymour@reed.edu (Robert Seymour) Subject: Re: WHAT car is this!? Article-I.D.: reed.1993Apr21.032905.29286 Reply-To: rseymour@reed.edu Organization: Reed College, Portland, OR Lines: 26 In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's my thing) writes: > > I was wondering if anyone out there could enlighten me on this car I saw > the other day. It was a 2-door sports car, looked to be from the late 60s/ > early 70s. It was called a Bricklin. The doors were really small. In addition, > the front bumper was separate from the rest of the body. This is > all I know. If anyone can tellme a model name, engine specs, years > of production, where this car is made, history, or whatever info you > have on this funky looking car, please e-mail. Bricklins were manufactured in the 70s with engines from Ford. They are rather odd looking with the encased front bumper. There aren't a lot of them around, but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a performance Ford with new styling slapped on top. > ---- brought to you by your neighborhood Lerxst ---- Rush fan? -- Robert Seymour rseymour@reed.edu Physics and Philosophy, Reed College (NeXTmail accepted) Artificial Life Project Reed College Reed Solar Energy Project (SolTrain) Portland, OR

Gunjan · Answer

古い投稿を知っています。しかし、私は http://scikit-learn.sourceforge.net/stable/ パッケージを試しました。コサインの類似性を見つけるためのコードを次に示します。問題は、このパッケージとのコサインの類似性をどのように計算するかということでした。ここにそのためのコードがあります

from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import TfidfVectorizer f = open("/root/Myfolder/scoringDocuments/doc1") doc1 = str.decode(f.read(), "UTF-8", "ignore") f = open("/root/Myfolder/scoringDocuments/doc2") doc2 = str.decode(f.read(), "UTF-8", "ignore") f = open("/root/Myfolder/scoringDocuments/doc3") doc3 = str.decode(f.read(), "UTF-8", "ignore") train_set = ["president of India",doc1, doc2, doc3] tfidf_vectorizer = TfidfVectorizer() tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set) #finds the tfidf score with normalization print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train) #here the first element of tfidf_matrix_train is matched with other three elements

ここで、クエリがtrain_setの最初の要素であり、doc1、doc2およびdoc3がコサイン類似度を使用してランク付けするドキュメントであると仮定します。その後、このコードを使用できます。

また、質問で提供されたチュートリアルは非常に役に立ちました。ここにすべてのパーツがあります part-I 、 part-II 、 part-III

出力は次のようになります。

[[ 1. 0.07102631 0.02731343 0.06348799]]

ここで、1はクエリがそれ自体と一致することを表し、他の3つはクエリをそれぞれのドキュメントと一致させるためのスコアです。

Salvador Dali · Answer

私が書いた別のチュートリアルをお見せしましょう。それはあなたの質問に答えますが、私たちがいくつかのことをしている理由を説明します。私もそれを簡潔にしようとしました。

つまり、文字列の配列であるlist_of_documentsと、文字列である別のdocumentがあります。このようなドキュメントは、documentに最も類似するlist_of_documentsから見つける必要があります。

それらを一緒に組み合わせましょう：documents = list_of_documents + [document]

依存関係から始めましょう。それぞれを使用する理由が明らかになります。

from nltk.corpus import stopwords import string from nltk.tokenize import wordpunct_tokenize as tokenize from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import TfidfVectorizer from scipy.spatial.distance import cosine

使用できるアプローチの1つは bag-of-words アプローチです。このアプローチでは、ドキュメント内の各Wordを他のWordから独立して扱い、すべてをまとめて大きな袋。ある観点からは、多くの情報が失われます（単語の接続方法など）が、別の観点からは、モデルが単純になります。

英語や他の人間の言語には、「a」、「the」、「in」などの「役に立たない」単語がたくさんあり、あまり一般的ではないので、あまり意味がありません。これらは stop words と呼ばれ、削除することをお勧めします。もう1つ気付くことがあるのは、「分析」、「分析者」、「分析」などの言葉が本当に似ているということです。これらには共通のルートがあり、すべて1つのWordに変換できます。このプロセスは stemming と呼ばれ、速度、攻撃性などが異なるさまざまなステマーが存在します。そこで、ストップワードを使用せずに、各ドキュメントを単語の語幹のリストに変換します。また、すべての句読点を破棄します。

porter = PorterStemmer() stop_words = set(stopwords.words('english')) modified_arr = [[porter.stem(i.lower()) for i in tokenize(d.translate(None, string.punctuation)) if i.lower() not in stop_words] for d in documents]

では、この言葉の袋はどのように役立つのでしょうか？ [a, b, c]、[a, c, a]、[b, c, d]の3つのバッグがあるとします。それらを基底の vectors [a, b, c, d]に変換できます。そのため、ベクターは[1, 1, 1, 0]、[2, 0, 1, 0]、および[0, 1, 1, 1]になります。同様のことは、私たちのドキュメントにもあります（ベクトルだけが長くなる方法です）。これで、ベクトルの次元を減らすために、多くの単語を削除し、他の単語もステミングしたことがわかります。ここに興味深い観察があります。長いドキュメントほど短い要素よりもポジティブな要素が多いため、ベクトルを正規化するのは良いことです。これは用語頻度TFと呼ばれ、人々はWordが他の文書で使用される頻度に関する追加情報も使用しました-逆文書頻度IDF。メトリック TF-IDFがあり、いくつかのフレーバーがあります。これは、sklearnの1行で実現できます:-)

modified_doc = [' '.join(i) for i in modified_arr] # this is only to convert our list of lists to list of strings that vectorizer uses. tf_idf = TfidfVectorizer().fit_transform(modified_doc)

実際には、ベクトライザーは、多くのことを行うことを許可しますストップワードの削除や小文字化など。 sklearnには英語以外のストップワードはありませんが、nltkにはあるので、私はそれらを別のステップで行いました。

したがって、すべてのベクトルが計算されました。最後のステップは、どれが最後のものに最も似ているかを見つけることです。それを達成するためのさまざまな方法があります。それらの1つは、ユークリッド距離です。これは、ここで説明されている理由のため、それほど大きくありません。別のアプローチは、 cosine類似性です。すべてのドキュメントを反復処理し、ドキュメントと最後のドキュメント間のコサイン類似度を計算します。

l = len(documents) - 1 for i in xrange(l): minimum = (1, None) minimum = min((cosine(tf_idf[i].todense(), tf_idf[l + 1].todense()), i), minimum) print minimum

これで、minimumには最適なドキュメントとそのスコアに関する情報が含まれます。

Sam · Answer

これはあなたを助けるはずです。

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(train_set) print tfidf_matrix cosine = cosine_similarity(tfidf_matrix[length-1], tfidf_matrix) print cosine

出力は次のようになります。

[[ 0.34949812 0.81649658 1. ]]

Paul Ogier · Answer

Tf-Idfトランスフォーマーをトレーニングデータに合わせて、テストデータとトレーニングデータを比較する関数を次に示します。利点は、n個の最も近い要素を見つけるためにすばやくピボットまたはグループ化できること、および計算がマトリックス単位でダウンしていることです。

def create_tokenizer_score(new_series, train_series, tokenizer): """ return the tf idf score of each possible pairs of documents Args: new_series (pd.Series): new data (To compare against train data) train_series (pd.Series): train data (To fit the tf-idf transformer) Returns: pd.DataFrame """ train_tfidf = tokenizer.fit_transform(train_series) new_tfidf = tokenizer.transform(new_series) X = pd.DataFrame(cosine_similarity(new_tfidf, train_tfidf), columns=train_series.index) X['ix_new'] = new_series.index score = pd.melt( X, id_vars='ix_new', var_name='ix_train', value_name='score' ) return score train_set = pd.Series(["The sky is blue.", "The Sun is bright."]) test_set = pd.Series(["The Sun in the sky is bright."]) tokenizer = TfidfVectorizer() # initiate here your own tokenizer (TfidfVectorizer, CountVectorizer, with stopwords...) score = create_tokenizer_score(train_series=train_set, new_series=test_set, tokenizer=tokenizer) score ix_new ix_train score 0 0 0 0.617034 1 0 1 0.862012