Sparkの潜在的ディリクレ割り当て（LDA）

Question

Spark潜在的ディリクレ割り当て（LDA）を実行するためのプログラムを書き込もうとしています。このSparkドキュメントページはサンプルデータでLDAを実行するための良い例以下はプログラムです

from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vectors # Load and parse the data data = sc.textFile("data/mllib/sample_lda_data.txt") parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) # Index documents with unique IDs corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() # Cluster the documents into three topics using LDA ldaModel = LDA.train(corpus, k=3) # Output topics. Each is a distribution over words (matching Word count vectors) print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):") topics = ldaModel.topicsMatrix() for topic in range(3): print("Topic " + str(topic) + ":") for Word in range(0, ldaModel.vocabSize()): print(" " + str(topics[Word][topic])) # Save and load model ldaModel.save(sc, "target/org/Apache/spark/PythonLatentDirichletAllocationExample/LDAModel") sameModel = LDAModel\ .load(sc, "target/org/Apache/spark/PythonLatentDirichletAllocationExample/LDAModel")

使用されるサンプル入力（sample_lda_data.txt）は次のとおりです

1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0

数字の代わりにテキストデータを含むテキストデータファイルで実行するようにプログラムを変更するにはどうすればよいですか？サンプルファイルに次のテキストが含まれているとします。

潜在的ディリクレ割り当て（LDA）は、テキストドキュメントのコレクションからトピックを推測するトピックモデルです。 LDAは、次のようにクラスタリングアルゴリズムと考えることができます。

トピックはクラスターセンターに対応し、ドキュメントはデータセットの例（行）に対応します。トピックとドキュメントはどちらも特徴空間に存在します。特徴ベクトルは単語数のベクトル（単語の袋）です。 LDAは、従来の距離を使用してクラスタリングを推定するのではなく、テキストドキュメントの生成方法の統計モデルに基づく関数を使用します。

prashanth · Accepted Answer

いくつかの調査を行った後、私はこの質問に答えようとしています。以下は、Sparkを使用して実際のテキストデータを含むテキストドキュメントでLDAを実行するためのサンプルコードです。

from pyspark.sql import SQLContext, Row from pyspark.ml.feature import CountVectorizer from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vector, Vectors path = "sample_text_LDA.txt" data = sc.textFile(path).zipWithIndex().map(lambda (words,idd): Row(idd= idd, words = words.split(" "))) docDF = spark.createDataFrame(data) Vector = CountVectorizer(inputCol="words", outputCol="vectors") model = Vector.fit(docDF) result = model.transform(docDF) corpus = result.select("idd", "vectors").rdd.map(lambda (x,y): [x,Vectors.fromML(y)]).cache() # Cluster the documents into three topics using LDA ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online') topics = ldaModel.topicsMatrix() vocabArray = model.vocabulary wordNumbers = 10 # number of words per topic topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers)) def topic_render(topic): # specify vector id of words to actual words terms = topic[0] result = [] for i in range(wordNumbers): term = vocabArray[terms[i]] result.append(term) return result topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect() for topic in range(len(topics_final)): print ("Topic" + str(topic) + ":") for term in topics_final[topic]: print (term) print ('
')

質問で言及されているようにテキストデータで抽出されたトピックは次のとおりです。