テキストファイルを読み取り、pythonの単一の単語に分割する

Question

たとえば、このような09807754 18 n 03 aristocrat 0 blue_blood 0 patricianのような数字と単語で構成されるテキストファイルがあり、各単語または数字が新しい行として表示されるように分割したいと思います。

空白の区切り文字は、ダッシュの付いた単語が接続されたままになるようにしたいので理想的です。

これは私がこれまでに持っているものです：

f = open('words.txt', 'r') for Word in f: print(Word)

ここからどうやって行くのか本当にわからないので、これを出力にしたいと思います：

09807754 18 n 3 aristocrat ...

dawg · Accepted Answer

データを引用符で囲まず、一度に1つのWordだけが必要な場合（ファイル内のスペースと改行の意味を無視）：

with open('words.txt','r') as f: for line in f: for Word in line.split(): print(Word)

ファイルの各行の単語のネストされたリストが必要な場合（たとえば、ファイルから行と列のマトリックスを作成する場合）：

with open("words.txt") as f: [line.split() for line in f]

または、ファイルをフラット化してファイル内の単語の単一のフラットリストにしたい場合は、次のようにすることができます。

with open('words.txt') as f: [Word for line in f for Word in line.split()]

正規表現ソリューションが必要な場合：

import re with open("words.txt") as f: for line in f: for Word in re.findall(r'\w+', line): # Word by Word

または、正規表現を使用した行ごとのジェネレーターにしたい場合：

 with open("words.txt") as f: (Word for line in f for Word in re.findall(r'\w+', line))

dugres · Answer

f = open('words.txt') for Word in f.read().split(): print(Word)

pambda · Answer

補足として、vvvveryの大きなファイルを読んでいて、一度にすべてのコンテンツをメモリに読み込まない場合は、bufferを使用して、各Wordをyieldで返すことを検討できます。

def read_words(inputfile): with open(inputfile, 'r') as f: while True: buf = f.read(10240) if not buf: break # make sure we end on a space (Word boundary) while not str.isspace(buf[-1]): ch = f.read(1) if not ch: break buf += ch words = buf.split() for Word in words: yield Word yield '' #handle the scene that the file is empty if __== "__main__": for Word in read_words('./very_large_file.txt'): process(Word)

Gaurav · Answer

できることは、nltkを使用して単語をトークン化し、すべての単語をリストに保存することです。 nltkがわからない場合。自然言語ツールキットの略で、自然言語の処理に使用されます。開始したい場合のリソースは次のとおりです[ http://www.nltk.org/book/]

import nltk from nltk.tokenize import Word_tokenize file = open("abc.txt",newline='') result = file.read() words = Word_tokenize(result) for i in words: print(i)

出力は次のようになります。

09807754 18 n 03 aristocrat 0 blue_blood 0 patrician

mujad · Answer

with open(filename) as file: words = file.read().split()

ファイル内のすべての単語のリスト。

import re with open(filename) as file: words = re.findall(r"([a-zA-Z\-]+)", file.read())

smac89 · Answer

行を読み取ったり分割したりする必要のない、完全に機能的なアプローチを次に示します。 itertools モジュールを利用します：

python 3の注：`itertools.imap`を`map`に置き換えます

import itertools def readwords(mfile): byte_stream = itertools.groupby( itertools.takewhile(lambda c: bool(c), itertools.imap(mfile.read, itertools.repeat(1))), str.isspace) return ("".join(group) for pred, group in byte_stream if not pred)

サンプル使用法：

>>> import sys >>> for w in readwords(sys.stdin): ... print (w) ... I really love this new method of reading words in python I really love this new method of reading words in python It's soo very Functional! It's soo very Functional! >>>

あなたの場合、これは関数を使用する方法だと思います：

with open('words.txt', 'r') as f: for Word in readwords(f): print(Word)

テキストファイルを読み取り、pythonの単一の単語に分割する

python 3の注：itertools.imapをmapに置き換えます

python 3の注：`itertools.imap`を`map`に置き換えます