Tensorflow語彙プロセッサ

Question

私は、テンソルフローを使用したテキスト分類に関するwildmlブログをフォローしています。コード文のmax_document_lengthの目的を理解できません：

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

また、vocab_processorから語彙を抽出するにはどうすればよいですか

Nitin · Accepted Answer

ボキャブラリープロセッサーオブジェクトからボキャブラリーを抽出する方法を見つけました。これは私にとって完璧に機能しました。

import numpy as np from tensorflow.contrib import learn x_text = ['This is a cat','This must be boy', 'This is a a dog'] max_document_length = max([len(x.split(" ")) for x in x_text]) ## Create the vocabularyprocessor object, setting the max lengh of the documents. vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) ## Transform the documents using the vocabulary. x = np.array(list(vocab_processor.fit_transform(x_text))) ## Extract Word:id mapping from the object. vocab_dict = vocab_processor.vocabulary_._mapping ## Sort the vocabulary dictionary on the basis of values(id). ## Both statements perform same task. #sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1)) sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1]) ## Treat the id's as index into list and create a list of words in the ascending order of id's ## Word with id i goes at index i of the list. vocabulary = list(list(Zip(*sorted_vocab))[0]) print(vocabulary) print(x)

Kirk Broadhurst · Answer

max_document_lengthの目的を理解できない

VocabularyProcessorはテキストドキュメントをベクトルにマップします。これらのベクトルは一定の長さである必要があります。

入力データレコードは、すべて同じ長さではない可能性があります（またはおそらくそうではありません）。たとえば、センチメント分析のためにセンテンスを使用している場合、センテンスはさまざまな長さになります。

このパラメーターをVocabularyProcessorに指定して、出力ベクトルの長さを調整できるようにします。ドキュメントによると、

max_document_length：ドキュメントの最大長。文書が長い場合、短い場合はトリミングされ、パディングされます。

ソースコードを確認してください。

_ def transform(self, raw_documents): """Transform documents to Word-id matrix. Convert words to ids with vocabulary fitted with fit or the one provided in the constructor. Args: raw_documents: An iterable which yield either str or unicode. Yields: x: iterable, [n_samples, max_document_length]. Word-id matrix. """ for tokens in self._tokenizer(raw_documents): Word_ids = np.zeros(self.max_document_length, np.int64) for idx, token in enumerate(tokens): if idx >= self.max_document_length: break Word_ids[idx] = self.vocabulary_.get(token) yield Word_ids _

行Word_ids = np.zeros(self.max_document_length)に注意してください。

_raw_documents_変数の各行は、長さ_max_document_length_のベクトルにマップされます。