ファイルから用語ドキュメントマトリックスを作成する

Question

example001.txtからexample100.txtまでのファイルのセットがあります。各ファイルには、スーパーセットのキーワードのリストが含まれています（必要に応じてスーパーセットを使用できます）。

したがって、example001.txtには

Apple banana ... otherfruit

これらのファイルを処理してマトリックスに似たものを生成できるようにしたいので、一番上の行にexamples*のリストがあり、下に果物があり、列に「1」がある場合は果物はファイルにあります。

例は...

x example1 example2 example3 Apple 1 1 0 Babana 0 1 0 Coconut 0 1 1

これをまとめるために、ある種のコマンドラインマジックを構築する方法はありますか？私はOSXを使用していて、PerlまたはPythonに満足しています...

slhck · Accepted Answer

Pythonでは、textminingを介してインストールできます

_Sudo pip install textmining _

次に、新しいファイルを作成します–それを_matrix.py_と呼び、以下を追加します。

_#!/usr/bin/env python import textmining import glob tdm = textmining.TermDocumentMatrix() files = glob.glob("/Users/foo/files/*.txt") print(files) for f in files: content = open(f).read() content = content.replace('
', ' ') tdm.add_doc(content) tdm.write_csv('matrix.csv', cutoff=1) _

保存して_chmod +x matrix.py_を呼び出します。ここで、_./matrix.py_を使用して実行します。このプログラムは、glob()で指定されたディレクトリを検索し、出力マトリックスを現在のディレクトリの_matrix.csv_に書き込みます。次のようになります。

ご覧のとおり、唯一の欠点は、ドキュメント名が出力されないことです。ただし、いくつかのbashコマンドを使用して、このリストを前に付けることができます。必要なのはファイル名のリストだけです。

_echo "" > files.txt; find /Users/foo/files/ -type f -iname "*.txt" >> files.txt _

次に、これを_matrix.csv_と一緒に貼り付けます。

_paste -d , files.txt matrix.csv > matrix2.csv _

Voilà、私たちの完全な用語-ドキュメントマトリックス：

それほど複雑でない解決策があると想像できますが、これはPythonであり、正しい行列全体を出力するようにコードを変更するのに十分な知識がありません。

andilabs · Answer

それはほとんどslhckソリューションです。 Pythonスクリプトの中にos.sytemを介して実行されるbashコマンドを追加し、すべてを1つにまとめましたpythonスクリプトをpythonおよびbashコンソール。

#!/usr/bin/env python import textmining import glob import os tdm = textmining.TermDocumentMatrix() files = glob.glob("/Users/andi/Desktop/python_nltk/dane/*.txt") os.system("""echo "" > files.txt; find /Users/andi/Desktop/python_nltk/dane -type f -iname "*.txt" >> files.txt""") print(files) for f in files: content = open(f).read() content = content.replace('
', ' ') tdm.add_doc(content) tdm.write_csv('matrix.csv', cutoff=1) os.system("""paste -d , files.txt matrix.csv > matrix2.csv """)

terdon · Answer

Slhckのpythonソリューションほどきれいなものを提供することはできませんが、純粋なbashソリューションを次に示します。

printf "	" && for file in ex*; do \ printf "%-15s" "$file "; done && echo "" && while read fruit; do \ printf "$fruit	"; for file in ex*; do \ printf "%-15s" `grep -wc $fruit $file`; done; echo ""; done < superset.txt

その恐ろしいものをターミナルにコピーして貼り付けると、果物のリストがsuperset.txtというファイルにあり、1行に1つの果物があると仮定すると、次のようになります。

 example1 example2 example3 Apple 1 2 2 banana 1 1 2 mango 0 1 1 orange 1 1 2 pear 0 1 1 Plum 0 0 1

説明：

printf " "：TABを印刷して、ファイル名をフルーツ名の末尾に揃えます。
for file in ex*; [...] done：ファイル名を出力します（名前がexで始まる唯一のファイルであると想定します。
echo ""：改行を出力します
while read fruit; do [...]; done <list：listは、言及したスーパーセットを含むテキストファイルである必要があります。つまり、すべてのフルーツ、1行に1つのフルーツです。このファイルはこのループで読み取られ、各フルーツは$fruitとして保存されます。
printf "$fruit ";：果物の名前とTABを印刷します。
for file in ex*; do [...]; done：ここで、各ファイルをもう一度調べ、grep -wc $fruit $fileを使用して、現在処理している果物がそのファイルで見つかった回数を取得します。

column を使用することもできるかもしれませんが、私はこれまで試したことがありません。

 The column utility formats its input into multiple columns. Rows are filled before columns. Input is taken from file oper‐ ands, or, by default, from the standard input. Empty lines are ignored unless the -e option is used.

そしてこれがPerlのものです。技術的には、これは長いものではありますが、1つのライナーです。

Perl -e 'foreach $file (@ARGV){open(F,"$file"); while(<F>){chomp; $fruits{$_}{$file}++}} print "	";foreach(sort @ARGV){printf("%-15s",$_)}; print "
"; foreach $fruit (sort keys(%fruits)){print "$fruit	"; do {$fruits{$fruit}{$_}||=0; printf("%-15s",$fruits{$fruit}{$_})} for @ARGV; print "
";}' ex*

ここでは、実際に理解できるコメント付きのスクリプト形式になっています。

#!/usr/bin/env Perl foreach $file (@ARGV){ ## cycle through the files open(F,"$file"); while(<F>){ chomp;## remove newlines ## Count the fruit. This is a hash of hashes ## where the fruit is the first key and the file ## the second. For each fruit then, we will end up ## with something like this: $fruits{Apple}{example1}=1 $fruits{$_}{$file}++; } } print "	"; ## pretty formatting ## Print each of the file names foreach(sort @ARGV){ printf("%-15s",$_) } print "
"; ## pretty formatting ## Now, cycle through each of the "fruit" we ## found when reading the files and print its ## count in each file. foreach $fruit (sort keys(%fruits)){ print "$fruit	"; ## print the fruit names do { $fruits{$fruit}{$_}||=0; ## Count should be 0 if none were found printf("%-15s",$fruits{$fruit}{$_}) ## print the value for each fruit } for @ARGV; print "
"; ## pretty formatting }

これには、スーパーセットが必要ではなく、任意の「フルーツ」に対処できるという利点があります。また、これらのソリューションはどちらもネイティブの* nixツールを使用しており、追加のパッケージをインストールする必要はありません。そうは言っても、slhckの答えのpythonソリューションはより簡潔で、よりきれいな出力を提供します。

Franck Dernoncourt · Answer

Pythonでは、 _sklearn.feature_extraction.text.CountVectorizer.fit_transform_ を使用できます。語彙辞書を学習し、 term-document matrix を返します。

例：

_import sklearn import sklearn.feature_extraction vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1) corpus = ['This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document? This is right.',] X = vectorizer.fit_transform(corpus).toarray() print('X: {0}'.format(X)) print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_)) _

出力：

_X: [[0 1 1 1 0 0 0 1 0 1] [0 1 0 1 0 0 2 1 0 1] [1 0 0 0 1 0 0 1 1 0] [0 1 1 2 0 1 0 1 0 2]] vectorizer.vocabulary_: {u'and': 0, u'right': 5, u'third': 8, u'this': 9, u'is': 3, u'one': 4, u'second': 6, u'the': 7, u'document': 1, u'first': 2} _

ファイルを操作しているので、メソッド sklearn.feature_extraction.text.CountVectorizer.transform() にも興味があるかもしれません。