spacy / nltkを使用してbi / tri-gramを生成する方法

Question

入力テキストは常に1〜3個の形容詞と名詞がある料理名のリストです

入力

thai iced tea spicy fried chicken sweet chili pork thai chicken curry

出力：

thai tea, iced tea spicy chicken, fried chicken sweet pork, chili pork thai chicken, chicken curry, thai curry

基本的に、私は文ツリーを解析し、形容詞と名詞を組み合わせてバイグラムを生成しようとしています。

そして私はspacyまたはnltkでこれを達成したいと思います

Petr Matuska · Answer

英語モデルでspacy2.0を使用しました。入力を解析するための名詞と「not-nouns」を見つけてから、not-nounsとnounsを組み合わせて、目的の出力を作成します。

あなたの入力：

s = ["thai iced tea", "spicy fried chicken", "sweet chili pork", "thai chicken curry",]

Spacyソリューション：

import spacy nlp = spacy.load('en') # import spacy, load model def noun_notnoun(phrase): doc = nlp(phrase) # create spacy object token_not_noun = [] notnoun_noun_list = [] for item in doc: if item.pos_ != "NOUN": # separate nouns and not nouns token_not_noun.append(item.text) if item.pos_ == "NOUN": noun = item.text for notnoun in token_not_noun: notnoun_noun_list.append(notnoun + " " + noun) return notnoun_noun_list

関数の呼び出し：

for phrase in s: print(noun_notnoun(phrase))

結果：

['thai tea', 'iced tea'] ['spicy chicken', 'fried chicken'] ['sweet pork', 'chili pork'] ['thai chicken', 'curry chicken']

lenz · Answer

NLTKを使用すると、いくつかの手順でこれを実現できます。

PoSはシーケンスにタグを付けます
目的のnグラムを生成します（あなたの例ではトリグラムはありませんが、トリグラムを介して生成され、中央のトークンを打ち抜くことができるスキップグラムです）
パターンに一致しないすべてのn-gramを破棄しますJJ NN。

例：

def jjnn_pairs(phrase): ''' Iterate over pairs of JJ-NN. ''' tagged = nltk.pos_tag(nltk.Word_tokenize(phrase)) for ngram in ngramise(tagged): tokens, tags = Zip(*ngram) if tags == ('JJ', 'NN'): yield tokens def ngramise(sequence): ''' Iterate over bigrams and 1,2-skip-grams. ''' for bigram in nltk.ngrams(sequence, 2): yield bigram for trigram in nltk.ngrams(sequence, 3): yield trigram[0], trigram[2]

パターンを拡張する('JJ', 'NN')そしてあなたのニーズに必要なn-gram。

構文解析の必要はないと思います。ただし、このアプローチの主な問題は、ほとんどのPoSタガーが、すべてを希望どおりに正確にタグ付けしない可能性があることです。たとえば、NLTKインストールのデフォルトのPoSタガーは、「chili」に[〜＃〜] nn [〜＃〜]ではなく[〜＃〜] jj [〜＃〜 ]、そして「揚げた」は[〜＃〜] vbd [〜＃〜]を得た。ただし、構文解析はそれを支援しません！

alvas · Answer

このようなもの：

>>> from nltk import bigrams >>> text = """thai iced tea ... spicy fried chicken ... sweet chili pork ... thai chicken curry""" >>> lines = map(str.split, text.split('
')) >>> for line in lines: ... ", ".join([" ".join(bi) for bi in bigrams(line)]) ... 'thai iced, iced tea' 'spicy fried, fried chicken' 'sweet chili, chili pork' 'thai chicken, chicken curry'

または、colibricore https://proycon.github.io/colibri-core/doc/#installation ; Pを使用します。