scikit-learn TfidfVectorizerによってTF-IDFが計算される方法

Question

次のコードを実行して、テキストマトリックスをTF-IDFマトリックスに変換します。

text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF'] from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None) X = vectorizer.fit_transform(text) X_vovab = vectorizer.get_feature_names() X_mat = X.todense() X_idf = vectorizer.idf_

次の出力が表示されます

X_vovab =

[u'calculation', u'computation', u'idf', u'product', u'string', u'tf', u'tfidf']

およびX_mat =

 ([[ 0. , 0. , 0. , 0. , 1.51082562, 0. , 0. ], [ 0. , 0. , 0. , 0. , 1.51082562, 0. , 0. ], [ 1.91629073, 1.91629073, 0. , 0. , 0. , 0. , 1.51082562], [ 0. , 0. , 1.91629073, 1.91629073, 0. , 1.91629073, 1.51082562]])

現在、これらのスコアの計算方法がわかりません。私の考えは、text [0]の場合、「string」のみのスコアが計算され、5番目の列にスコアがあるということです。ただし、TF_IDFは2である項の頻度の積であり、log（4/2）であるIDFは1.39であり、マトリックスに示すように1.51ではありません。 TF-IDFスコアはscikit-learnでどのように計算されますか。

Rabbit · Accepted Answer

TF-IDFは、実際にはTfidfTransformerを使用し、CountVectorizerを継承するScikit LearnのTfidfVectorizerによって複数のステップで実行されます。

それをより簡単にするために行う手順を要約しましょう。

tfsはCountVectorizerのfit_transform（）によって計算されます
idfsはTfidfTransformerのfit（）によって計算されます
tfidfsはTfidfTransformerのtransform（）によって計算されます

ソースコードを確認できますここ。

例に戻りましょう。これは、語彙の第5項、第1文書（X_mat [0,4]）のtfidf重みに対して行われる計算です。

最初に、1番目のドキュメントの「string」のtf：

tf = 1

次に、スムージングが有効になっている（デフォルトの動作）の 'string'のidf：

df = 2 N = 4 idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238

そして最後に、（ドキュメント0、機能4）のtfidfの重み：

tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238

Tfidf行列を正規化しないことを選択したことに気づきました。ほとんどのモデルでは特徴行列（または設計行列）を正規化する必要があるため、tfidf行列の正規化は一般的で通常推奨されるアプローチです。

TfidfVectorizerは、計算の最終ステップとして、デフォルトで出力行列をL-2正規化します。正規化されているということは、0と1の間の重みしかないということです。

Christian Hirsch · Answer

正確な計算式は docs で与えられます：

Tf-idfに使用される実際の式は、tf * idfではなく、tf *（idf + 1）= tf + tf * idfです。

そして

ドキュメントの頻度に1を追加することにより、idfの重みを平滑化します。まるで、コレクション内のすべての用語を1回だけ含む追加のドキュメントが表示されたように見えます。

つまり、_1.51082562_は1.51082562=1+ln((4+1)/(2+1))として取得されます

Poonam Agrawal · Answer

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from collections import Counter corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ] print(corpus) vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) z=X.toarray() #term frequency is printed print(z) vectorizer1 = TfidfVectorizer(min_df=1) X1 = vectorizer1.fit_transform(corpus) idf = vectorizer1.idf_ print (dict(Zip(vectorizer1.get_feature_names(), idf))) #printing idf print(X1.toarray()) #printing tfidf #formula # df = 2 # N = 4 # idf = ln(N + 1 / df + 1) + 1 = log (5 / 3) + 1 = 1.5108256238 #formula # tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238