Pythonでウィキペディアのカテゴリをグループ化する方法は？

Question

データセットの各概念について、対応するウィキペディアのカテゴリを保存しました。たとえば、次の5つの概念とそれに対応するWikipediaのカテゴリを考えてみましょう。

高トリグリセリド血症：['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
酵素阻害剤：['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
バイパス手術：['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
パース：['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
気候：['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']

ご覧のとおり、最初の3つの概念は医療分野に属しています（残りの2つの用語は医療用語ではありません）。

より正確には、私の概念を医療用と非医療用に分けたいです。ただし、カテゴリーだけで概念を分けることは非常に困難です。たとえば、2つの概念enzyme inhibitorとbypass surgeryは医療分野にありますが、それらのカテゴリは互いに非常に異なります。

したがって、カテゴリのparent categoryを取得する方法があるかどうかを知りたいです（たとえば、enzyme inhibitorおよびbypass surgeryのカテゴリはmedical親に属しています]カテゴリー）

現在、pymediawikiとpywikibotを使用しています。ただし、私はこれら2つのライブラリーだけに制限されておらず、他のライブラリーを使用するソリューションも用意しています。

[〜＃〜]編集[〜＃〜]

@IlmariKaronenによって示唆されているように、私はcategories of categoriesも使用しています。得られた結果は次のとおりです（Tcategoryの近くの小さなフォントはcategories of the categoryです））。

ただし、これらのカテゴリの詳細を使用して、特定の用語が医学的であるか非医学的であるかを判断する方法はまだ見つかりませんでした。

さらに、@ IlmariKaronenで指摘されているように、Wikiprojectを使用すると詳細が表示される可能性があります。しかし、Medicine wikiprojectにはすべての医学用語が含まれていないようです。したがって、他のウィキプロジェクトもチェックする必要があります。

編集：ウィキペディアの概念からカテゴリを抽出する私の現在のコードは次のとおりです。これは、次のようにpywikibotまたはpymediawikiを使用して行うことができます。

ライブラリの使用pymediawiki

メディアウィキをpwとしてインポート
```
p = wikipedia.page('enzyme inhibitor') print(p.categories) 
```

ライブラリの使用pywikibot

import pywikibot as pw site = pw.Site('en', 'wikipedia') print([ cat.title() for cat in pw.Page(site, 'support-vector machine').categories() if 'hidden' not in cat.categoryinfo ])

カテゴリのカテゴリも、@ IlmariKaronenの回答に示されているのと同じ方法で実行できます。

テストの概念のより長いリストを探している場合は、以下でより多くの例を挙げました。

['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']

非常に長いリストについては、以下のリンクを確認してください。 https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing

注：ソリューションが100％機能することを期待していません（提案されたアルゴリズムが私にとって十分な医療概念の多くを検出できる場合）

必要に応じて詳細をお知らせいたします。

Szymon Maszke · Accepted Answer

ソリューションの概要

さて、私は複数の方向から問題に取り組みます。ここにいくつかの素晴らしい提案があります。もし私があなたなら、それらのアプローチのアンサンブルを使用します（大多数の投票、バイナリケースの分類子の50％以上で合意されているラベルの予測）。

以下のアプローチを考えています：

アクティブラーニング（以下に私が提供するアプローチ例）
MediaWiki backlinks 回答として提供 @ TavoGC
[〜＃〜] sparql [〜＃〜] @ Stanislav Kralin および/または-によって質問へのコメントとして提供される祖先カテゴリ親カテゴリ @ Meena Nagarajan によって提供されます（これらの2つは、違いに基づいて単独でアンサンブルになる可能性がありますが、そのためには、両方の作成者に連絡して結果を比較する必要があります）。

この方法では、3つのうち2つは、特定の概念が医療の概念であることに同意する必要があります。これにより、エラーの可能性がさらに最小限に抑えられます。

その間、私は @ ananand_v.singh によって提示されたagainstアプローチを this answer で主張します。：

距離メトリックユークリッドではない、コサイン類似性ははるかに優れたメトリックです（たとえば spaCy で使用されます）。ベクトル（Word2vecまたはGloVeがトレーニングされた方法です）
正しく理解すれば、多くの人工クラスターが作成されますが、必要なのは、薬と薬以外の2つだけです。さらに、薬の重心は、薬自体を中心とはしていません。これは、重心が薬から遠くに移動し、computerやhumanなどの言葉（または薬に対するあなたの意見に合わない他の言葉）が入る可能性があるなど、追加の問題を引き起こしますクラスター。
結果を評価することは困難です。さらに、問題は厳密に主観的です。さらに、単語ベクトルは視覚化して理解するのが難しい（PCA/TSNE /同様の多くの単語を使用して低次元[2D/3D]にキャストすると、まったく意味のない結果が得られます[そうですね、PCAで試してみましたより長いデータセットの約5％の説明付き分散が得られます。

上記で強調した問題に基づいて、私は active learning を使用した解決策を考え出しました。これは、そのような問題へのかなり忘れられたアプローチです。

アクティブラーニングアプローチ

この機械学習のサブセットでは、正確なアルゴリズムを見つけるのに苦労した場合（用語がmedicalカテゴリの一部であるとはどういう意味かなど）、人間の「専門家」に質問します（そうではありません）実際にエキスパートである必要はありません）いくつかの答えを提供します。

知識エンコーディング

anand_v.singh が指摘したように、Wordベクターは最も有望なアプローチの1つであり、ここでもそれを使用します（ただし、IMOははるかにクリーンで簡単な方法です）。

私の回答では彼のポイントを繰り返すつもりはないので、2セントを追加します。

しない現在利用可能な最先端の技術として、文脈に応じたWord埋め込みを使用（例 [~~~~ bert [〜＃〜] ）
非表示の概念がいくつあるかを確認します（たとえば、ゼロのベクトルとして表されます）。それはチェックされるべきであり（そして私のコードでチェックされます、時が来たらさらに議論があります）、それらのほとんどが存在する埋め込みを使用することができます。

spaCyを使用した類似性の測定

このクラスは、spaCyのGloVe Wordベクトルとしてエンコードされたmedicineと他のすべての概念との類似性を測定します。

class Similarity: def __init__(self, centroid, nlp, n_threads: int, batch_size: int): # In our case it will be medicine self.centroid = centroid # spaCy's Language model (english), which will be used to return similarity to # centroid of each concept self.nlp = nlp self.n_threads: int = n_threads self.batch_size: int = batch_size self.missing: typing.List[int] = [] def __call__(self, concepts): concepts_similarity = [] # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL) for i, concept in enumerate( self.nlp.pipe( concepts, n_threads=self.n_threads, batch_size=self.batch_size ) ): if concept.has_vector: concepts_similarity.append(self.centroid.similarity(concept)) else: # If document has no vector, it's assumed to be totally dissimilar to centroid concepts_similarity.append(-1) self.missing.append(i) return np.array(concepts_similarity)

このコードは、重心との類似性を測定する各概念の数値を返します。さらに、それはそれらの表現を欠いている概念のインデックスを記録します。次のように呼ばれるかもしれません：

import json import typing import numpy as np import spacy nlp = spacy.load("en_vectors_web_lg") centroid = nlp("medicine") concepts = json.load(open("concepts_new.txt")) concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)( concepts )

new_concepts.jsonの代わりにデータを使用できます。

spacy.load を見て、私が en_vectors_web_lg を使用したことに注意してください。これは685.000の一意のWordベクター（大量）で構成されており、箱から出して使用しても問題ない場合があります。 spaCyをインストールした後、個別にダウンロードする必要があります。詳細については、上記のリンクを参照してください。

その他複数の重心語を使用できます。 diseaseやhealthのような単語を追加し、それらのWordベクトルを平均化します。それがあなたのケースに良い影響を与えるかどうかはわかりません。

その他の可能性は、複数の重心を使用し、各概念と複数の重心間の類似性を計算することです。そのような場合、いくつかのしきい値がある場合があります。これにより、いくつかの false positives が削除される可能性がありますが、medicineに類似していると見なすことができるいくつかの用語が見落とされる場合があります。さらに、それはケースをさらに複雑にしますが、結果が満足のいくものではない場合は、上記の2つのオプションを検討する必要があります（これらのオプションの場合のみ、事前に考えずにこのアプローチに飛び込まないでください）。

これで、コンセプトの類似性の大まかな基準がわかりました。しかし、特定の概念が医学に対して0.1の正の類似性を持つことはどういう意味ですか？医療として分類すべきコンセプトですか？それとも、それはすでに遠すぎますか？

専門家に聞く

しきい値（それ以下では非医療と見なされます）を取得するには、人間にいくつかの概念を分類するように人間に依頼するのが最も簡単です（それがアクティブラーニングの目的です）。ええ、私はそれが非常に単純な形の能動学習であることを知っていますが、とにかくそれをそのように考えるでしょう。

最適なしきい値（または反復の最大数）に到達するまで、概念を分類するように人間に要求するsklearn-likeインターフェイスを持つクラスを作成しました。

class ActiveLearner: def __init__( self, concepts, concepts_similarity, max_steps: int, samples: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.max_steps: int = max_steps self.samples: int = samples self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1

samples引数は、各反復中にエキスパートに表示される例の数を示します（これは最大であり、サンプルがすでに要求されている場合、または表示するのに十分な数がない場合、返される数は少なくなります）。
stepは、各反復におけるしきい値の低下（完全な類似性を意味する1から開始）を表します。
change_multiplier-専門家が概念に関連性がない（または概念の複数が返されるため、ほとんど関係ない）場合、ステップにこの浮動小数点数が乗算されます。各反復でのstepの変更間の正確なしきい値を正確に示すために使用されます。
概念はそれらの類似性に基づいてソートされます（概念が類似しているほど高くなります）

以下の機能は、専門家に意見を求め、彼の回答に基づいて最適なしきい値を見つけます。

def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("
Are those concepts related to medicine?
") print( "
".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "
", ) return input("[y]es / [n]o / [any]quit ") return "y"

質問の例は次のようになります：

Are those concepts related to medicine? 0. anesthetic drug 1. child and adolescent psychiatry 2. tertiary care center 3. sex therapy 4. drug design 5. pain disorder 6. psychiatric rehabilitation 7. combined oral contraceptive 8. family practitioner committee 9. cancer family syndrome 10. social psychology 11. drug sale 12. blood system [y]es / [n]o / [any]quit y

...エキスパートからの回答を解析しています：

# True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False

そして最後にActiveLearnerのコードコード全体が、エキスパートに応じて類似性の最適なしきい値を見つけます。

class ActiveLearner: def __init__( self, concepts, concepts_similarity, samples: int, max_steps: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.samples: int = samples self.max_steps: int = max_steps self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("
Are those concepts related to medicine?
") print( "
".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "
", ) return input("[y]es / [n]o / [any]quit ") return "y" # True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False def fit(self): for _ in range(self.max_steps): available_concepts_indices = np.nonzero( self.concepts_similarity >= self.threshold_ )[0] if available_concepts_indices.size != 0: decision = self._ask_expert(available_concepts_indices) if not self._parse_expert_decision(decision): break else: self.threshold_ -= self.step return self

全体として、いくつかの質問に手動で回答する必要がありますが、このアプローチはway more私の意見では正確です。

さらに、すべてのサンプルを実行する必要はありません。サンプルのごく一部です。医療用語を構成するサンプルの数（40の医療サンプルと10の非医療サンプルを表示するかどうかを判断する必要があるかどうか）を決定できます。これにより、このアプローチを好みに合わせて調整できます。外れ値がある場合（たとえば、50のうちの1つのサンプルが非医学的である場合）、しきい値はまだ有効であると考えます。

もう一度：間違った分類の可能性を最小限に抑えるために、このアプローチは他のアプローチと組み合わせる必要があります。

分類子

エキスパートからしきい値を取得すると、分類は瞬時に行われます。以下に分類の簡単なクラスを示します。

class Classifier: def __init__(self, centroid, threshold: float): self.centroid = centroid self.threshold: float = threshold def predict(self, concepts_pipe): predictions = [] for concept in concepts_pipe: predictions.append(self.centroid.similarity(concept) > self.threshold) return predictions

簡潔にするため、ここに最終的なソースコードを示します。

import json import typing import numpy as np import spacy class Similarity: def __init__(self, centroid, nlp, n_threads: int, batch_size: int): # In our case it will be medicine self.centroid = centroid # spaCy's Language model (english), which will be used to return similarity to # centroid of each concept self.nlp = nlp self.n_threads: int = n_threads self.batch_size: int = batch_size self.missing: typing.List[int] = [] def __call__(self, concepts): concepts_similarity = [] # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL) for i, concept in enumerate( self.nlp.pipe( concepts, n_threads=self.n_threads, batch_size=self.batch_size ) ): if concept.has_vector: concepts_similarity.append(self.centroid.similarity(concept)) else: # If document has no vector, it's assumed to be totally dissimilar to centroid concepts_similarity.append(-1) self.missing.append(i) return np.array(concepts_similarity) class ActiveLearner: def __init__( self, concepts, concepts_similarity, samples: int, max_steps: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.samples: int = samples self.max_steps: int = max_steps self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("
Are those concepts related to medicine?
") print( "
".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "
", ) return input("[y]es / [n]o / [any]quit ") return "y" # True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False def fit(self): for _ in range(self.max_steps): available_concepts_indices = np.nonzero( self.concepts_similarity >= self.threshold_ )[0] if available_concepts_indices.size != 0: decision = self._ask_expert(available_concepts_indices) if not self._parse_expert_decision(decision): break else: self.threshold_ -= self.step return self class Classifier: def __init__(self, centroid, threshold: float): self.centroid = centroid self.threshold: float = threshold def predict(self, concepts_pipe): predictions = [] for concept in concepts_pipe: predictions.append(self.centroid.similarity(concept) > self.threshold) return predictions if __name__ == "__main__": nlp = spacy.load("en_vectors_web_lg") centroid = nlp("medicine") concepts = json.load(open("concepts_new.txt")) concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)( concepts ) learner = ActiveLearner( np.array(concepts), concepts_similarity, samples=20, max_steps=50 ).fit() print(f"Found threshold {learner.threshold_}
") classifier = Classifier(centroid, learner.threshold_) pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096) predictions = classifier.predict(pipe) print( "
".join( f"{concept}: {label}" for concept, label in Zip(concepts[20:40], predictions[20:40]) ) )

いくつかの質問に回答した後、しきい値0.1（[-1, 0.1)は医療用と見なされ、[0.1, 1]は医療用と見なされます）で次の結果が得られました。

kartagener s syndrome: True summer season: True taq: False atypical neuroleptic: True anterior cingulate: False acute respiratory distress syndrome: True circularity: False mutase: False adrenergic blocking drug: True systematic desensitization: True the turning point: True 9l: False pyridazine: False bisoprolol: False trq: False propylhexedrine: False type 18: True darpp 32: False rickettsia conorii: False sport shoe: True

ご覧のとおり、このアプローチは完全とはほど遠いため、最後のセクションでは可能な改善について説明しました。

可能な改善

最初に述べたように、他の回答と混合した私のアプローチを使用すると、おそらくmedicineに属するsport shoeのようなアイデアが除外され、アクティブラーニングアプローチは、2つの間の引き分けの場合の決定的な投票になります上記のヒューリスティック。

アクティブラーニングアンサンブルを作成することもできます。 1つのしきい値（0.1など）の代わりに、それらの複数（増加または減少）を使用します。それらを0.1, 0.2, 0.3, 0.4, 0.5としましょう。

sport shoeが次のように各しきい値のTrue/Falseを取得するとします。

True True False False False、

過半数の投票を行う場合、2票のうち3票でnon-medicalとマークします。さらに、しきい値が厳しすぎると、しきい値を下回るしきい値が不満になる場合も軽減されます（True/FalseがTrue True True False Falseの場合）。

私が思いついた最後の可能な改善：上記のコードでは、概念を作成するWordベクトルの平均であるDocベクトルを使用しています。たとえば、1つの単語が欠落している（ゼロで構成されるベクトル）とすると、その単語はmedicineセントロイドからより遠くに押し出されます。あなたはそれを望まないかもしれません（ニッチな医学用語[gpvまたは他の略語]はそれらの表現を欠いているかもしれないので）、そのような場合、ゼロとは異なるベクトルのみを平均化できます。

この投稿はかなり長いので、質問がある場合は以下に投稿してください。

Ilmari Karonen · Answer

「したがって、カテゴリのparent categoryを取得する方法があるかどうか知りたい（たとえば、enzyme inhibitorとbypass surgeryのカテゴリはmedicalに属している親カテゴリ）"

MediaWikiカテゴリは、それ自体がWikiページです。「親カテゴリ」は、「子」カテゴリページが属するカテゴリにすぎません。したがって、他のWikiページのカテゴリを取得するのとまったく同じ方法で、カテゴリの親カテゴリを取得できます。

たとえば、 pymediawiki を使用します：

p = wikipedia.page('Category:Enzyme inhibitors') parents = p.categories

TavoGLC · Answer

Wikipediaのカテゴリを、各カテゴリに対して返されるmediawikiリンクとバックリンクで分類してみることができます

import re from mediawiki import MediaWiki #TermFind will search through a list a given term def TermFind(term,termList): responce=False for val in termList: if re.match('(.*)'+term+'(.*)',val): responce=True break return responce #Find if the links and backlinks lists contains a given term def BoundedTerm(wikiPage,term): aList=wikiPage.links bList=wikiPage.backlinks responce=False if TermFind(term,aList)==True and TermFind(term,bList)==True: responce=True return responce container=[] wikipedia = MediaWiki() for val in termlist: cpage=wikipedia.page(val) if BoundedTerm(cpage,'term')==True: container.append('medical') else: container.append('nonmedical')

アイデアは、ほとんどのカテゴリーで共有される用語を推測しようとすることです。私は、生物学、医学、および疾患を試し、良い結果を得ています。おそらく、BoundedTermsのmulpile呼び出しを使用して分類を行ったり、複数の用語を1回呼び出して結果を分類のために組み合わせたりすることができます。それが役に立てば幸い

anand_v.singh · Answer

NLPにはWord Vectorの概念があります。NLPは基本的に大量のテキストを調べて、単語を多次元ベクトルに変換し、それらのベクトル間の距離を短くし、それらの間の類似性を大きくします。ことは、多くの人々がすでにこのWordベクターを生成し、非常に寛容なライセンスの下でそれらを利用可能にしていることです。あなたの場合、あなたはWikipediaで作業していて、ここにWordベクターが存在します http://dumps.wikimedia.org /enwiki/latest/enwiki-latest-pages-articles.xml.bz2

ウィキペディアのコーパスからのほとんどの単語が含まれているため、これらはこのタスクに最も適していますが、それらがあなたに適していない場合、または将来削除される場合は、これらから使用できます。これを行うにはより良い方法があると言います。つまり、それらをテンソルフローのユニバーサル言語モデルembedモジュールに渡すことで、重い作業のほとんどを実行する必要がなく、その詳細を読むことができます- here。ウィキペディアのテキストダンプの後に置いたのは、医療サンプルを扱う場合、作業が少し難しいと言われているためです。この論文はそれに取り組むための解決策を提案していますが、私はそれを試したことがないので、その正確さを確信できません。

これで、テンソルフローからWord埋め込みを使用する方法が簡単になりました。

embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2") embeddings = embed(["Input Text here as"," List of strings"]) session.run(embeddings)

Tensorflowに慣れておらず、このコードだけを実行しようとすると、いくつかの問題が発生する可能性がありますこのリンクをたどるこれを使用する方法を完全に説明し、そこからあなたはできるはずですこれを必要に応じて簡単に変更できます。

そうは言っても、最初に彼のtensorlfowの埋め込みモジュールとそれらの事前トレーニングされたWord埋め込みをチェックすることをお勧めします。それらが機能しない場合は、Wikimediaリンクをチェックしてください。それも機能しない場合は、ペーパーの概念に進んでください。リンクしました。この回答はNLPアプローチを説明しているので、100％正確ではないので、先に進む前にそのことを覚えておいてください。

グローブベクトル https://nlp.stanford.edu/projects/glove/

Facebookの高速テキスト： https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

またはこれ http://www.statmt.org/lm-benchmark/1-billion-Word-language-modeling-benchmark-r13output.tar.gz

これを実装する際に問題が発生した場合、colabチュートリアルに従って問題を以下の質問とコメントに追加してください。そこから先に進むことができます。

クラスタートピックに追加されたコードを編集

要約、単語ベクトルを使用するのではなく、要約文をエンコードしています

ファイルcontent.py

def AllTopics(): topics = []# list all your topics, not added here for space restricitons for i in range(len(topics)-1): yield topics[i]

ファイルの概要Generator.py

import wikipedia import pickle from content import Alltopics summary = [] failed = [] for topic in Alltopics(): try: summary.append(wikipedia.summary(Tuple((topic,str(topic))))) except Exception as e: failed.append(Tuple((topic,e))) with open("summary.txt", "wb") as fp: pickle.dump(summary , fp) with open('failed.txt', 'wb') as fp: pickle.dump('failed', fp)

ファイルSimilartiyCalculator.py

import tensorflow as tf import tensorflow_hub as hub import numpy as np import os import pandas as pd import re import pickle import sys from sklearn.cluster import AgglomerativeClustering from sklearn import metrics from scipy.cluster import hierarchy from scipy.spatial import distance_matrix try: with open("summary.txt", "rb") as fp: # Unpickling summary = pickle.load(fp) except Exception as e: print ('Cannot load the summary file, Please make sure that it exists, if not run Summary Generator first', e) sys.exit('Read the error message') module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3" embed = hub.Module(module_url) tf.logging.set_verbosity(tf.logging.ERROR) messages = [x[1] for x in summary] labels = [x[0] for x in summary] with tf.Session() as session: session.run([tf.global_variables_initializer(), tf.tables_initializer()]) message_embeddings = session.run(embed(messages)) # In message embeddings each vector is a second (1,512 vector) and is numpy.ndarray (noOfElemnts, 512) X = message_embeddings agl = AgglomerativeClustering(n_clusters=5, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', pooling_func='deprecated') agl.fit(X) dist_matrix = distance_matrix(X,X) Z = hierarchy.linkage(dist_matrix, 'complete') dendro = hierarchy.dendrogram(Z) cluster_labels = agl.labels_

これは https://github.com/anandvsingh/WikipediaSimilarity のGitHubでもホストされています。ここでsimilarity.txtファイルとその他のファイルを見つけることができますが、私の場合は実行できませんでしたすべてのトピックについてですが、トピックの完全なリスト（リポジトリを直接複製してSummaryGenerator.pyを実行する）で実行することをお勧めします。uploadSimilarity.txt期待どおりの結果が得られない場合に備えて、プルリクエストを介して。また、可能であれば、message_embeddingsをトピックとしてcsvファイルにアップロードし、そこに埋め込みます。

編集後の変更2imilarityGeneratorを階層ベースのクラスタリングに切り替えました（凝集）樹形図の下部にタイトル名を保持し、そのためにここでのデンドログラムの定義、いくつかのサンプルの表示を確認しましたが、結果は非常によく見えます。n_clustersの値を変更して、モデルを微調整できます。注：これには、サマリージェネレータを再度実行する必要があります。ここからそれを取得できるはずです。あなたがしなければならないことは、n_clusterのいくつかの値を試して、すべての医学用語がグループ化されているのを確認してから、そのクラスターのcluster_labelを見つけることですこれで完了です。ここでは要約でグループ化しているため、クラスターはより正確になります。問題が発生した場合、または何かが理解できない場合は、以下にコメントしてください。

Londala · Answer

wikipediaライブラリは、wikipedia.WikipediaPage(page).categoriesが単純なリストを返すため、特定のページからカテゴリを抽出するのにも適しています。ライブラリでは、すべて同じタイトルのページを検索することもできます。

医学では、多くの重要な語根と接尾辞があるように思われるので、キーワードを見つけるというアプローチは、医学用語を見つけるための良いアプローチかもしれません。

import wikipedia def categorySorter(targetCats, pagesToCheck, mainCategory): targetList = [] nonTargetList = [] targetCats = [i.lower() for i in targetCats] print('Sorting pages...') print('Sorted:', end=' ', flush=True) for page in pagesToCheck: e = openPage(page) def deepList(l): for item in l: if item[1] == 'SUBPAGE_ID': deepList(item[2]) else: catComparator(item[0], item[1], targetCats, targetList, nonTargetList, pagesToCheck[-1]) if e[1] == 'SUBPAGE_ID': deepList(e[2]) else: catComparator(e[0], e[1], targetCats, targetList, nonTargetList, pagesToCheck[-1]) print() print() print('Results:') print(mainCategory, ': ', targetList, sep='') print() print('Non-', mainCategory, ': ', nonTargetList, sep='') def openPage(page): try: pageList = [page, wikipedia.WikipediaPage(page).categories] except wikipedia.exceptions.PageError as p: pageList = [page, 'NONEXIST_ID'] return except wikipedia.exceptions.DisambiguationError as e: pageCategories = [] for i in e.options: if '(disambiguation)' not in i: pageCategories.append(openPage(i)) pageList = [page, 'SUBPAGE_ID', pageCategories] return pageList finally: return pageList def catComparator(pageTitle, pageCategories, targetCats, targetList, nonTargetList, lastPage): # unhash to view the categories of each page #print(pageCategories) pageCategories = [i.lower() for i in pageCategories] any_in = False for i in targetCats: if i in pageTitle: any_in = True if any_in: print('', end = '', flush=True) Elif compareLists(targetCats, pageCategories): any_in = True if any_in: targetList.append(pageTitle) else: nonTargetList.append(pageTitle) # Just prints a pretty list, you can comment out until next hash if desired if any_in: print(pageTitle, '(T)', end='', flush=True) else: print(pageTitle, '(F)',end='', flush=True) if pageTitle != lastPage: print(',', end=' ') # No more commenting return any_in def compareLists (a, b): for i in a: for j in b: if i in j: return True return False

コードは、キーワードとサフィックスのリストを各ページのタイトルとそれらのカテゴリと比較して、ページが医学的に関連しているかどうかを判断するだけです。また、より大きなトピックの関連ページ/サブページを調べ、それらが同様に関連しているかどうかを判断します。私は私の薬に精通していないので、カテゴリを許しますが、これは底にタグを付ける例です：

medicalCategories = ['surgery', 'medic', 'disease', 'drugs', 'virus', 'bact', 'fung', 'pharma', 'cardio', 'pulmo', 'sensory', 'nerv', 'derma', 'protein', 'amino', 'unii', 'chlor', 'carcino', 'oxi', 'oxy', 'sis', 'disorder', 'enzyme', 'eine', 'sulf'] listOfPages = ['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous'] categorySorter(medicalCategories, listOfPages, 'Medical')

この例のリストは、少なくとも私の知る限りでは、リストにあるものの約70％を取得します。

Meena Nagarajan · Answer

質問は私には少し不明瞭に見え、解決するのは簡単な問題のようではなく、いくつかのNLPモデルが必要になる場合があります。また、「概念」と「カテゴリ」という用語は同じ意味で使用されます。私が理解しているのは、酵素阻害剤、バイパス手術、高トリグリセリジマなどの概念は、医療として、残りを非医療として組み合わせる必要があるということです。この問題は、カテゴリ名だけではなく、より多くのデータを必要とします。コーパスは、LDAモデルをトレーニングするために必要です（たとえば）。テキスト情報全体がアルゴリズムに供給され、各概念の最も可能性の高いトピックが返されます。

https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/