Keras Tokenizernum_wordsが機能していないようです

Question

>>> t = Tokenizer(num_words=3) >>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"] >>> t.fit_on_texts(l) >>> t.Word_index {'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11, 'other': 9, 'so': 5, 'world': 1, 'hello': 4}

t.Word_indexには上位3語しかないと思っていました。私は何が間違っているのですか？

Marcin Możejko · Accepted Answer

あなたがしていることに何も悪いことはありません。 Word_indexは、後で使用する最も頻繁な単語の数に関係なく、同じ方法で計算されます（ここを参照してください）。したがって、変換メソッドを呼び出すと、Tokenizerは最も一般的な3つの単語のみを使用し、同時に、後で使用しないことが明らかな場合でも、すべての単語のカウンターを保持します。

farid khafizov · Answer

num_wordsを少数（たとえば3）に制限しても、fit_on_texts出力（たとえばWord_index、Word_counts、Word_docs）には影響しません。 texts_to_matrixに影響します。結果の行列にはnum_words（3）列が含まれます。

>>> t = Tokenizer(num_words=3) >>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"] >>> t.fit_on_texts(l) >>> print(t.Word_index) {'world': 1, 'this': 2, 'is': 3, 'hello': 4, 'so': 5, 'fantastic': 6, 'there': 7, 'no': 8, 'other': 9, 'like': 10, 'one': 11} >>> t.texts_to_matrix(l, mode='count') array([[0., 1., 1.], [0., 1., 1.]])