単語の頻度を数え、それから辞書を作る

Question

テキストファイルからすべての単語を取得し、辞書で単語の頻度を数えたいです。

例：'this is the textfile, and it is used to take words and count'

d = {'this': 1, 'is': 2, 'the': 1, ...}

私はそれほど遠くはありませんが、それを完了する方法がわかりません。これまでの私のコード：

import sys argv = sys.argv[1] data = open(argv) words = data.read() data.close() wordfreq = {} for i in words: #there should be a counter and somehow it must fill the dict.

Don · Accepted Answer

Collections.Counterを使用したくない場合は、独自の関数を作成できます。

import sys filename = sys.argv[1] fp = open(filename) data = fp.read() words = data.split() fp.close() unwanted_chars = ".,-_ (and so on)" wordfreq = {} for raw_Word in words: Word = raw_Word.strip(unwanted_chars) if Word not in wordfreq: wordfreq[Word] = 0 wordfreq[Word] += 1

より細かいことについては、正規表現を見てください。

Grijesh Chauhan · Answer

@Michaelによって提案されたCounterライブラリからcollectionsを使用する方が良いアプローチですが、コードを改善するためだけに回答を追加しています（これが新しいPython学習者）：

コメントからコード内コードを改善したいようです。そして、ファイルの内容を言葉で読むことができると思います（通常、私はread()関数の使用を避け、_for line in file_descriptor:_種類のコードを使用します）。

wordsは文字列であるため、Forループでは、_for i in words:_ループ変数iは単語ではなく文字です。文字列wordsの単語ではなく、文字列の文字を繰り返し処理しています。コードスナイプに続くこの通知を理解するには：

_>>> for i in "Hi, h r u?": ... print i ... H i , h r u ? >>> _

代わりに文字列文字ごとに単語ごとに繰り返すことはあなたが望んでいたことではないので、単語ごとに繰り返すには、Pythonの文字列クラスからメソッド/関数を分割する必要があります。
str.split(str="", num=string.count(str)) methodは文字列内のすべての単語のリストを返しますstrを区切り文字として使用し（指定しない場合はすべての空白で分割）、オプションで分割数をnumに制限します。

以下のコード例に注意してください。

スプリット：

_>>> "Hi, how are you?".split() ['Hi,', 'how', 'are', 'you?'] _

分割のあるループ：

_>>> for i in "Hi, how are you?".split(): ... print i ... Hi, how are you? _

そして、それはあなたが必要としているように見えます。 Word _Hi,_を除いて、split()はデフォルトで空白で分割されるため、_Hi,_は不要な単一の文字列（および明らかに）として保持されます。ファイル内の単語の頻度をカウントします。

良い解決策の1つは、正規表現を使用することですが、最初に答えを単純にするために、replace()メソッドで答えます。メソッド str.replace(old, new[, max]) は、古いものが新しいものに置き換えられた文字列のコピーを返します。オプションで、置き換えの数を最大に制限します。

次に、私が提案したいことについて、以下のコード例を確認してください。

_>>> "Hi, how are you?".split() ['Hi,', 'how', 'are', 'you?'] # it has , with Hi >>> "Hi, how are you?".replace(',', ' ').split() ['Hi', 'how', 'are', 'you?'] # , replaced by space then split _

ループ：

_>>> for Word in "Hi, how are you?".replace(',', ' ').split(): ... print Word ... Hi how are you? _

さて、頻度を数える方法：

1つの方法は、@ Michaelが提案したようにCounterを使用することですが、空のdictから開始したいアプローチを使用することです。このコードのようなことをしてください：

_words = f.read() wordfreq = {} for Word in .replace(', ',' ').split(): wordfreq[Word] = wordfreq.setdefault(Word, 0) + 1 # ^^ add 1 to 0 or old value from dict _

私がしていることは？：最初はwordfreqが空であるため、最初は_wordfreq[Word]_に割り当てることができません（キーの例外が発生します）。そのため、setdefaultdictメソッドを使用しました。

dict.setdefault(key, default=None) はget()に似ていますが、キーがまだdictにない場合は、_dict[key]=default_を設定します。したがって、新しい単語が初めて来たときに、setdefaultを使用してdictに_0_を設定し、次に_1_を追加して同じdictに割り当てます。

単一のopenの代わりに open を使用して同等のコードを記述しました。

_with open('~/Desktop/file') as f: words = f.read() wordfreq = {} for Word in words.replace(',', ' ').split(): wordfreq[Word] = wordfreq.setdefault(Word, 0) + 1 print wordfreq _

これは次のように実行されます。

_$ cat file # file is this is the textfile, and it is used to take words and count $ python work.py # indented manually {'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1} _

re.split(pattern, string, maxsplit=0, flags=0) を使用する

Forループを変更するだけです：for i in re.split(r"[,\s]+", words):、これは正しい出力を生成するはずです。

編集：複数の句読記号がある可能性があるため、すべての英数字を検索する方が適切です。

_>>> re.findall(r'[\w]+', words) # manually indent output ['this', 'is', 'the', 'textfile', 'and', 'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count'] _

forループを次のように使用します：for Word in re.findall(r'[\w]+', words):

read()を使用せずにコードを作成するにはどうすればよいですか。

ファイルは次のとおりです。

_$ cat file This is the text file, and it is used to take words and count. And multiple Lines can be present in this file. It is also possible that Same words repeated in with capital letters. _

コードは次のとおりです。

_$ cat work.py import re wordfreq = {} with open('file') as f: for line in f: for Word in re.findall(r'[\w]+', line.lower()): wordfreq[Word] = wordfreq.setdefault(Word, 0) + 1 print wordfreq _

lower()を使用して、大文字を小文字に変換しました。

出力：

_$python work.py # manually strip output {'and': 3, 'letters': 1, 'text': 1, 'is': 3, 'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1, 'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1, 'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1, 'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2, 'lines': 1, 'can': 1, 'the': 1} _

Michael · Answer

from collections import Counter t = 'this is the textfile, and it is used to take words and count' dict(Counter(t.split())) >>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}

または、数える前に句読点を削除することをお勧めします。

dict(Counter(t.replace(',', '').replace('.', '').split())) >>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1}

user1749431 · Answer

以下は、文字列を取得し、split（）を使用してリストに分割し、リストをループして、Pythonのcount関数count（）を使用して文内の各項目の頻度をカウントします。単語iとその頻度は、空のリストlsにタプルとして配置され、dict（）を使用してキーと値のペアに変換されます。

sentence = 'this is the textfile, and it is used to take words and count'.split() ls = [] for i in sentence: Word_count = sentence.count(i) # Pythons count function, count() ls.append((i,Word_count)) dict_ = dict(ls) print dict_

出力; {'and'：2、 'count'：1、 'used'：1、 'this'：1、 'is'：2、 'it'：1、 'to'：1、 'take'：1、 '単語 '：1、' the '：1、' textfile、 '：1}

Rajeev Sharma · Answer

#open your text book,Counting Word frequency File_obj=open("Counter.txt",'r') w_list=File_obj.read() print(w_list.split()) di=dict() for Word in w_list.split(): if Word in di: di[Word]=di[Word] + 1 else: di[Word]=1 max_count=max(di.values()) largest=-1 maxusedword='' for k,v in di.items(): print(k,v) if v>largest: largest=v maxusedword=k print(maxusedword,largest)

Rangita R · Answer

int型のデフォルトの辞書を使用することもできます。

 from collections import defaultdict wordDict = defaultdict(int) text = 'this is the textfile, and it is used to take words and count'.split(" ") for Word in text: wordDict[Word]+=1

説明：値がint型のデフォルト辞書を初期化します。このように、任意のキーのデフォルト値は0になり、キーが辞書に存在するかどうかを確認する必要はありません。次に、スペースを含むテキストを単語のリストに分割します。次に、リストを繰り返し処理して、Wordのカウントをインクリメントします。

AnitaAgrawal · Answer

私のアプローチは、地面からいくつかのことをすることです。

テキスト入力から句読点を削除します。
単語のリストを作成します。
空の文字列を削除します。
リストを繰り返し処理します。
新しい各単語を値1の辞書へのキーにします。
Wordがキーとしてすでに存在する場合は、その値を1つインクリメントします。

text = '''this is the textfile, and it is used to take words and count''' Word = '' #This will hold each Word wordList = [] #This will be collection of words for ch in text: #traversing through the text character by character #if character is between a-z or A-Z or 0-9 then it's valid character and add to Word string.. if (ch >= 'a' and ch <= 'z') or (ch >= 'A' and ch <= 'Z') or (ch >= '0' and ch <= '9'): Word += ch Elif ch == ' ': #if character is equal to single space means it's a separator wordList.append(Word) # append the Word in list Word = '' #empty the Word to collect the next Word wordList.append(Word) #the last Word to append in list as loop ended before adding it to list print(wordList) wordCountDict = {} #empty dictionary which will hold the Word count for Word in wordList: #traverse through the Word list if wordCountDict.get(Word.lower(), 0) == 0: #if Word doesn't exist then make an entry into dic with value 1 wordCountDict[Word.lower()] = 1 else: #if Word exist then increament the value by one wordCountDict[Word.lower()] = wordCountDict[Word.lower()] + 1 print(wordCountDict)

別のアプローチ：

text = '''this is the textfile, and it is used to take words and count''' for ch in '.\'!")(,;:?-
': text = text.replace(ch, ' ') wordsArray = text.split(' ') wordDict = {} for Word in wordsArray: if len(Word) == 0: continue else: wordDict[Word.lower()] = wordDict.get(Word.lower(), 0) + 1 print(wordDict)

Fuji Komalan · Answer

sentence = "this is the textfile, and it is used to take words and count" # split the sentence into words. # iterate thorugh every Word counter_dict = {} for Word in sentence.lower().split(): # add the Word into the counter_dict initalize with 0 if Word not in counter_dict: counter_dict[Word] = 0 # increase its count by 1 counter_dict[Word] =+ 1