コーパス引数のDocumentTermMatrixエラー

Question

私は次のコードを持っています：

_# returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\s+|\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus, tolower) corpus_clean <- tm_map(corpus_clean, removeNumbers) corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english')) corpus_clean <- tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <- tm_map(corpus_clean, trim) news_dtm <- DocumentTermMatrix(corpus_clean) # errors here _

DocumentTermMatrix()メソッドを実行すると、次のエラーが表示されます。

エラー：inherits（doc、 "TextDocument"）はTRUEではありません

なぜこのエラーが発生するのですか？私の行はテキスト文書ではありませんか？

_corpus_clean_を検査したときの出力は次のとおりです。

_[[153]] [1] obama holds technical school model us [[154]] [1] oil boom produces jobs bonanza archaeologists [[155]] [1] islamic terrorist group expands territory captures tikrit [[156]] [1] republicans democrats feel eric cantors loss [[157]] [1] tea party candidates try build cantor loss [[158]] [1] vehicles materials stored delaware bridges [[159]] [1] hill testimony hagel defends bergdahl trade [[160]] [1] Tweet selfpropagates tweetdeck [[161]] [1] blackwater guards face trial iraq shootings [[162]] [1] calif man among soldiers killed afghanistan [[163]] [1] stocks fall back world bank cuts growth Outlook [[164]] [1] jabhat alnusra longer useful turkey [[165]] [1] catholic bishops keep focus abortion marriage [[166]] [1] barbra streisand visits hill heart disease [[167]] [1] Rand paul cantors loss reason stop talking immigration [[168]] [1] israeli airstrike kills northern gaza _

編集：ここに私のデータがあります：

_type,text neutral,The week in 32 photos neutral,Look at me! 22 selfies of the week neutral,Inside rebel tunnels in Homs neutral,Voices from Ukraine neutral,Water dries up ahead of World Cup positive,Who's your hero? Nominate them neutral,Anderson Cooper: Here's how positive,"At fire scene, she rescues the pet" neutral,Hunger in the land of plenty positive,Helping women escape 'the life' neutral,A tour of the sex underworld neutral,Miss Universe Thailand steps down neutral,China's 'naked officials' crackdown negative,More held over Pakistan stoning neutral,Watch landmark Cold War series neutral,In photos: History of the Cold War neutral,Turtle predicts World Cup winner neutral,What devoured great white? positive,Nun wins Italy's 'The Voice' neutral,Bride Price app sparks debate neutral,China to deport 'pork' artist negative,Lightning hits moving car neutral,Singer won't be silenced neutral,Poland's mini desert neutral,When monarchs retire negative,Murder on Street View? positive,Meet armless table tennis champ neutral,Incredible 400 year-old globes positive,Man saves falling baby neutral,World's most controversial foods _

私は次のように取得します：

_news_raw <- read.csv('news_csv.csv', stringsAsFactors = F) _

編集： traceback（）は次のとおりです。

_> news_dtm <- DocumentTermMatrix(corpus_clean) Error: inherits(doc, "TextDocument") is not TRUE > traceback() 9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, domain = NA) 8: stopifnot(inherits(doc, "TextDocument"), is.list(control)) 7: FUN(X[[1L]], ...) 6: lapply(X, FUN, ...) 5: mclapply(unname(content(x)), termFreq, control) 4: TermDocumentMatrix.VCorpus(x, control) 3: TermDocumentMatrix(x, control) 2: t(TermDocumentMatrix(x, control)) 1: DocumentTermMatrix(corpus_clean) _

inherits(corpus_clean, "TextDocument")を評価するとFALSEです。

MrFlick · Accepted Answer

これは_tm 0.5.10_でうまく機能したように見えますが、_tm 0.6.0_の変更は壊れているようです。問題は、関数tolowerおよびtrimが必ずしもTextDocumentsを返さないことです（古いバージョンは自動的に変換を行ったように見えます）。代わりに文字を返し、DocumentTermMatrixは文字のコーパスの処理方法がわかりません。

だから、に変更することができます

_corpus_clean <- tm_map(news_corpus, content_transformer(tolower)) _

または実行できます

_corpus_clean <- tm_map(corpus_clean, PlainTextDocument) _

非標準の変換（getTransformations()にない変換）がすべて完了した後、DocumentTermMatrixを作成する直前。これにより、すべてのデータがPlainTextDocumentに格納され、DocumentTermMatrixが正常に動作するようになります。

Rodrigo Araujo · Answer

TMに関する記事でこの問題を解決する方法を見つけました。

以下にエラーが続く例：

getwd() require(tm) files <- DirSource(directory="texts/", encoding="latin1") # import files corpus <- VCorpus(x=files) # load files, create corpus summary(corpus) # get a summary corpus <- tm_map(corpus,removePunctuation) corpus <- tm_map(corpus,stripWhitespace) corpus <- tm_map(corpus,removePunctuation); matrix_terms <- DocumentTermMatrix(corpus)

警告メッセージ：

TermDocumentMatrix.VCorpus（x、control）：無効なドキュメント識別子

このエラーは、Term Document Matrixを実行するにはクラスVector Sourceのオブジェクトが必要ですが、以前の変換ではテキストのコーパスが文字で変換されるため、関数で受け入れられないクラスが変更されるために発生します。

ただし、関数content_transformerをtm_mapコマンド内に追加する場合、関数TermDocumentMatrixを使用して続行する前にもう1つのコマンドを必要としない場合があります。

以下のコードはクラスを変更し（最後の2行目を参照）、エラーを回避します。

getwd() require(tm) files <- DirSource(directory="texts/", encoding="latin1") corpus <- VCorpus(x=files) # load files, create corpus summary(corpus) # get a summary corpus <- tm_map(corpus,content_transformer(removePunctuation)) corpus <- tm_map(corpus,content_transformer(stripWhitespace)) corpus <- tm_map(corpus,content_transformer(removePunctuation)) corpus <- Corpus(VectorSource(corpus)) # change class matrix_term <- DocumentTermMatrix(corpus)

Renmelcon · Answer

これを変更：

corpus_clean <- tm_map(news_corpus, tolower)

このため：

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

gopal · Answer

これは動作するはずです。

remove.packages(tm) install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.Zip",repos=NULL) library(tm)