Pythonで2つの単語間の最短の依存パスを見つける方法は？

Question

Python与えられた依存関係ツリーで2つの単語間の依存関係パスを見つけようとします。

文について

大衆文化のロボットは、束縛されていない人間のエージェンシーの素晴らしさを私たちに思い出させるためにあります。

Practnlptools（ https://github.com/biplab-iitb/practNLPTools ）を使用して、次のような依存関係の解析結果を取得しました。

nsubj(are-5, Robots-1) xsubj(remind-8, Robots-1) amod(culture-4, popular-3) prep_in(Robots-1, culture-4) root(ROOT-0, are-5) advmod(are-5, there-6) aux(remind-8, to-7) xcomp(are-5, remind-8) dobj(remind-8, us-9) det(awesomeness-12, the-11) prep_of(remind-8, awesomeness-12) amod(agency-16, unbound-14) amod(agency-16, human-15) prep_of(awesomeness-12, agency-16)

これは次のように視覚化することもできます（ https://demos.explosion.ai/displacy/ から撮影した写真）

「robots」と「are」の間のパスの長さは1で、「robots」と「awesomeness」の間のパスの長さは4になります。

私の質問は、依存関係の解析結果の上にありますが、2つの単語間の依存関係パスまたは依存関係パスの長さを取得するにはどうすればよいですか？

私の現在の検索結果から、nltkのParentedTreeは役に立ちますか？

ありがとう！

HugoMailhot · Accepted Answer

あなたの問題は、2つのノード間の最短経路を見つけなければならないグラフ問題として簡単に考えることができます。

依存関係の解析をグラフに変換するには、まず、依存関係が文字列として提供されるという事実に対処する必要があります。あなたはこれを手に入れたい：

'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'

このように見えるように：

[('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]

このようにして、タプルリストを networkx モジュールからグラフコンストラクターにフィードできます。このコンストラクターは、リストを分析してグラフを作成し、さらに最短パスの長さを提供する適切なメソッドを提供します。 2つの指定されたノード間。

必要なインポート

import re import networkx as nx from practnlptools.tools import Annotator

文字列を目的のタプルリスト形式で取得する方法

annotator = Annotator() text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency.""" dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse'] dp_list = dep_parse.split('\n') pattern = re.compile(r'.+?$(.+?), (.+?)$') edges = [] for dep in dp_list: m = pattern.search(dep) edges.append((m.group(1), m.group(2)))

グラフの作成方法

graph = nx.Graph(edges) # Well that was easy

最短経路長の計算方法

print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))

このスクリプトは、Robots-1を経由してawesomeness-12からremind-8に到達できるため、依存関係の解析で与えられた最短パスが実際には長さ2であることを明らかにします。

1. xsubj(remind-8, Robots-1) 2. prep_of(remind-8, awesomeness-12)

この結果が気に入らない場合は、いくつかの依存関係をフィルタリングすることを検討してください。この場合、xsubj依存関係をグラフに追加することはできません。

Franck Dernoncourt · Answer

HugoMailhotの answer は素晴らしいです。 spacy 2つの単語間の最短の依存関係パスを見つけたいユーザーのために似たようなものを書きます（HugoMailhotの答えは practNLPTools に依存しています）。

文：

大衆文化のロボットは、束縛されていない人間のエージェンシーの素晴らしさを私たちに思い出させるためにあります。

依存関係ツリーに従う：

2つの単語間の最短の依存パスを見つけるためのコードは次のとおりです。

import networkx as nx import spacy nlp = spacy.load('en') # https://spacy.io/docs/usage/processing-text document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True) print('document: {0}'.format(document)) # Load spacy's dependency tree into a networkx graph edges = [] for token in document: # FYI https://spacy.io/docs/api/token for child in token.children: edges.append(('{0}-{1}'.format(token.lower_,token.i), '{0}-{1}'.format(child.lower_,child.i))) graph = nx.Graph(edges) # https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11')) print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11')) print(nx.shortest_path(graph, source='robots-0', target='agency-15'))

出力：

4 ['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11'] ['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']

Spacyとnetworkxをインストールするには：

Sudo pip install networkx Sudo pip install spacy Sudo python -m spacy.en.download parser # will take 0.5 GB

Spacyの依存関係の解析に関するいくつかのベンチマーク： https://spacy.io/docs/api/

Franck Dernoncourt · Answer

この回答は、StanfordCoreNLPに依存して文の依存関係ツリーを取得します。 networkxを使用する場合、HugoMailhotの answer からかなりのコードを借用します。

コードを実行する前に、次のことを行う必要があります。

Sudo pip install pycorenlp（Stanford CoreNLPのPythonインターフェース）
ダウンロード Stanford CoreNLP

次のようにStanfordCoreNLPサーバーを起動します。

Java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000

次に、次のコードを実行して、2つの単語間の最短の依存パスを見つけることができます。

import networkx as nx from pycorenlp import StanfordCoreNLP from pprint import pprint nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000)) def get_stanford_annotations(text, port=9000, annotators='tokenize,ssplit,pos,lemma,depparse,parse'): output = nlp.annotate(text, properties={ "timeout": "10000", "ssplit.newlineIsSentenceBreak": "two", 'annotators': annotators, 'outputFormat': 'json' }) return output # The code expects the document to contains exactly one sentence. document = 'Robots in popular culture are there to remind us of the awesomeness of'\ 'unbound human agency.' print('document: {0}'.format(document)) # Parse the text annotations = get_stanford_annotations(document, port=9000, annotators='tokenize,ssplit,pos,lemma,depparse') tokens = annotations['sentences'][0]['tokens'] # Load Stanford CoreNLP's dependency tree into a networkx graph edges = [] dependencies = {} for Edge in annotations['sentences'][0]['basic-dependencies']: edges.append((Edge['governor'], Edge['dependent'])) dependencies[(min(Edge['governor'], Edge['dependent']), max(Edge['governor'], Edge['dependent']))] = Edge graph = nx.Graph(edges) #pprint(dependencies) #print('edges: {0}'.format(edges)) # Find the shortest path token1 = 'Robots' token2 = 'awesomeness' for token in tokens: if token1 == token['originalText']: token1_index = token['index'] if token2 == token['originalText']: token2_index = token['index'] path = nx.shortest_path(graph, source=token1_index, target=token2_index) print('path: {0}'.format(path)) for token_id in path: token = tokens[token_id-1] token_text = token['originalText'] print('Node {0}	token_text: {1}'.format(token_id,token_text))

出力は次のとおりです。

document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency. path: [1, 5, 8, 12] Node 1 token_text: Robots Node 5 token_text: are Node 8 token_text: remind Node 12 token_text: awesomeness

スタンフォードCoreNLPはオンラインでテストできることに注意してください： http://nlp.stanford.edu:8080/parser/index.jsp

この回答は、Stanford CoreNLP 3.6.0。、pycorenlp 0.3.0、およびpython 3.5 x64 on Windows 7 SP1 x64Ultimateでテストされました。