テキストドキュメントのステミングにはpythonモジュールが必要です

Question

前処理段階でテキストドキュメントをステミングするための適切なpythonモジュールが必要です。

これを見つけた

http://pypi.python.org/pypi/PyStemmer/1.0.1

しかし、提供されたリンクにドキュメントが見つかりません。

ドキュメンテーションやその他の優れたステミングアルゴリズムの場所を知っている人は誰でも助けてください。

ditkin · Accepted Answer

あなたは試してみたいかもしれません [〜＃〜] nltk [〜＃〜]

>>> from nltk import PorterStemmer >>> PorterStemmer().stem('complications')

0xF · Answer

ここで説明したこれらのステマーはすべてアルゴリズムステマーであるため、次のような予期しない結果を常に生成する可能性があります。

In [3]: from nltk.stem.porter import * In [4]: stemmer = PorterStemmer() In [5]: stemmer.stem('identified') Out[5]: u'identifi' In [6]: stemmer.stem('nonsensical') Out[6]: u'nonsens'

ルートワードを正しく取得するには、Hunspell Stemmerなどの辞書ベースのステマーが必要です。pythonの実装は次のとおりです link 。サンプルコードはこちら

>>> import hunspell >>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff') >>> hobj.spell('spookie') False >>> hobj.suggest('spookie') ['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill'] >>> hobj.spell('spooky') True >>> hobj.analyze('linked') [' st:link fl:D'] >>> hobj.stem('linked') ['link']

shiva · Answer

Pythonステミングモジュールには、Porter、Porter2、Paice-Husk、Lovinsなどのさまざまなステミングアルゴリズムの実装があります。 http://pypi.python.org/pypi/stemming/1.

 >> from stemming.porter2 import stem >> stem("factionally") faction

KenHBS · Answer

トピックモデリングの gensimパッケージには、Porter Stemmerアルゴリズムが付属しています。

>>> from gensim import parsing >>> gensim.parsing.stem_text("trying writing nonsense") 'try write nonsens'

PorterStemmerは、gensimに実装されている唯一のステミングオプションです。

補足：ほとんどのテキストマイニング関連モジュールには、ポーターのステミング、空白の削除、ストップワードの削除などの単純な前処理手順のための独自の実装があると想像できます。

Brice M. Dempsey · Answer

PyStemmerは、SnowballステミングライブラリへのPythonインターフェイスです。

ドキュメントはここにあります： https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart.txt https://github.com/snowballstem/pystemmer/ blob/master/docs/quickstart_python3.txt