web-dev-qa-db-ja.com

Python

英語には 数回の収縮 があります。例えば:

you've -> you have
he's -> he is

これらは、自然言語処理を行っているときに頭痛の種になることがあります。 Pythonライブラリがあり、これらの収縮を拡大できますか?

33
Maarten

私はそのウィキペディアの収縮から拡張のページをpython辞書(下記参照)にしました

ご想像のとおり、辞書を照会するときは必ず二重引用符を使用する必要があることに注意してください。

enter image description here

また、ウィキペディアのページのように複数のオプションを残しました。自由に変更してください。正しい展開への曖昧さの解消は難しい問題であることに注意してください!

contractions = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}
47
arturomp

ライブラリは必要ありません。たとえば、正規表現を使用することができます。

>>> import re
>>> contractions_dict = {
...     'didn\'t': 'did not',
...     'don\'t': 'do not',
... }
>>> contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
>>> def expand_contractions(s, contractions_dict=contractions_dict):
...     def replace(match):
...         return contractions_dict[match.group(0)]
...     return contractions_re.sub(replace, s)
...
>>> expand_contractions('You don\'t need a library')
'You do not need a library'
17
alko

上記の答えは完全にうまく機能し、あいまいな収縮にはより良い可能性があります(あいまいなケースはそれほど多くないと主張しますが)。もっと読みやすく保守しやすいものを使用します。

import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase


test = "Hey I'm Yann, how're you and how's it going ? That's interesting: I'd love to hear more about it."
print(decontracted(test))
# Hey I am Yann, how are you and how is it going ? That is interesting: I would love to hear more about it.

それは私が考えていなかったいくつかの欠陥があるかもしれません。

他の回答 から再投稿

7
Yann Dubois

これは、非常にクールで使いやすいライブラリです https://pypi.python.org/pypi/pycontractions/1.0.1

使用例(リンクで詳細):

from pycontractions import Contractions

# Load your favorite Word2vec model
cont = Contractions('GoogleNews-vectors-negative300.bin')

# optional, prevents loading on first expand_texts call
cont.load_models()

out = list(cont.expand_texts(["I'd like to know how I'd done that!",
                            "We're going to the Zoo and I don't think I'll be home for dinner.",
                            "Theyre going to the Zoo and she'll be home for dinner."], precise=True))
print(out)

また、GoogleNews-vectors-negative300.bin、上記のpycontractionsリンクでダウンロードするためのリンクも必要です。 * python3のサンプルコード。

5
Joe9008

このためのライブラリを見つけました。contractionsとても簡単です。

import contractions
print(contractions.fix("you've"))
print(contractions.fix("he's"))

出力:

you have
he is
5
Hammad Hassan

ここでアルコの答えに少し加えたいと思います。ウィキペディアをチェックすると、言及されている英語の縮約の数は100未満です。実際のシナリオでは、この数はそれ以上になる可能性があります。しかし、それでも、英語の収縮語には200〜300語があれば十分だと確信しています。今、あなたはそれらのために別個のライブラリを取得したいですか(しかし、あなたが探しているものが実際に存在するとは思わない)。ただし、辞書と正規表現を使用すると、この問題を簡単に解決できます。 Nice tokenizer as Natural Language Toolkit を使用することをお勧めします。残りは自分で実装しても問題ないはずです。

3

これは古い質問ですが、私が見ることができる限りこれに対する本当の解決策がまだないので、私は答えるだろうと考えました。

関連するNLPプロジェクトでこれに取り組む必要があり、ここには何も存在しないように思われたため、問題に取り組むことにしました。興味があるなら、私の expander githubリポジトリ を確認できます。

これは、NLTK、スタンフォードコアNLPモデル(個別にダウンロードする必要があります)、および 前のanswer の辞書に基づいた、かなりひどく最適化された(私が思う)プログラムです。必要な情報はすべて、READMEおよびコメント付きの豪華なコードに含まれている必要があります。コメント付きのコードはデッドコードであることがわかります。

expander.pyの入力例は次の文です。

    ["I won't let you get away with that",  # won't ->  will not
    "I'm a bad person",  # 'm -> am
    "It's his cat anyway",  # 's -> is
    "It's not what you think",  # 's -> is
    "It's a man's world",  # 's -> is and 's possessive
    "Catherine's been thinking about it",  # 's -> has
    "It'll be done",  # 'll -> will
    "Who'd've thought!",  # 'd -> would, 've -> have
    "She said she'd go.",  # she'd -> she would
    "She said she'd gone.",  # she'd -> had
    "Y'all'd've a great time, wouldn't it be so cold!", # Y'all'd've -> You all would have, wouldn't -> would not
    " My name is Jack.",   # No replacements.
    "'Tis questionable whether Ma'am should be going.", # 'Tis -> it is, Ma'am -> madam
    "As history tells, 'twas the night before Christmas.", # 'Twas -> It was
    "Martha, Peter and Christine've been indulging in a menage-à-trois."] # 've -> have

出力先

    ["I will not let you get away with that",
    "I am a bad person",
    "It is his cat anyway",
    "It is not what you think",
    "It is a man's world",
    "Catherine has been thinking about it",
    "It will be done",
    "Who would have thought!",
    "She said she would go.",
    "She said she had gone.",
    "You all would have a great time, would not it be so cold!",
    "My name is Jack.",
    "It is questionable whether Madam should be going.",
    "As history tells, it was the night before Christmas.",
    "Martha, Peter and Christine have been indulging in a menage-à-trois."]

したがって、この小さな一連のテスト文について、いくつかのエッジケースをテストすることにしました。

このプロジェクトは現在重要性を失っているので、これを積極的に開発することはもうありません。このプロジェクトの助けをいただければ幸いです。すべきことはTODOリストに書かれています。または、私のpythonを改善する方法についてのヒントがあれば、私もとても感謝します。

0
Yannick