pythonリストで単語をステミングする方法は？

Question

私はpython以下のようなリストを持っています

documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"]

今度はそれを（各Wordに）ステムして別のリストを取得する必要があります。それ、どうやったら出来るの？

Gareth Latty · Accepted Answer

from stemming.porter2 import stem documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] documents = [[stem(Word) for Word in sentence.split(" ")] for sentence in documents]

ここでは、 list comprehension を使用してメインリスト内の各文字列をループ処理し、単語のリストに分割しています。次に、そのリストをループ処理して、各Wordをステミングし、ステミングされた単語の新しいリストを返します。

ステミングがインストールされた状態でこれを試していないことに注意してください-私はコメントからそれを取り、自分で使用したことはありません。ただし、これはリストを単語に分割するための基本的な概念です。これにより、単語のリストのリストが生成され、元の分離は維持されます。

この分離を望まない場合は、次のようにすることができます。

documents = [stem(Word) for sentence in documents for Word in sentence.split(" ")]

代わりに、1つの連続したリストが残ります。

最後に単語を結合したい場合は、次のようにできます。

documents = [" ".join(sentence) for sentence in documents]

またはそれを一行で行うには：

documents = [" ".join([stem(Word) for Word in sentence.split(" ")]) for sentence in documents]

文の構造を維持する場合、または

documents = " ".join(documents)

それを無視するところ。

Thomas · Answer

NLTK（Natural Language ToolKit）をご覧になることをお勧めします。これには、さまざまな異なるステマーを含むモジュール nltk.stem があります。

この質問も参照してください。

cha0site · Answer

よし。したがって、 stemming パッケージを使用すると、次のようになります。

from stemming.porter2 import stem from itertools import chain def flatten(listOfLists): "Flatten one level of nesting" return list(chain.from_iterable(listOfLists)) def stemall(documents): return flatten([ [ stem(Word) for Word in line.split(" ")] for line in documents ])

Arash Hatami · Answer

[〜＃〜] nltk [〜＃〜] を使用できます：

from nltk.stem import PorterStemmer ps = PorterStemmer() final = [[ps.stem(token) for token in sentence.split(" ")] for sentence in documents]

NLTKにはIRシステムの多くの機能があります。確認してください

Thomas Decaux · Answer

あなたはwhooshを使うことができます：（ http://whoosh.readthedocs.io/ ）

from whoosh.analysis import CharsetFilter, StemmingAnalyzer from whoosh import fields from whoosh.support.charset import accent_map my_analyzer = StemmingAnalyzer() | CharsetFilter(accent_map) tokens = my_analyzer("hello you, comment ça va ?") words = [token.text for token in tokens] print(' '.join(words))

Ghazal · Answer

from nltk.stem import PorterStemmer ps = PorterStemmer() list_stem = [ps.stem(Word) for Word in list]

9113303 · Answer

ステミングには、PorterStemmerまたはLancasterStemmerを使用できます。