Question

我正在尝试使用R中的TM包清理我的文本语料库但是我一直收到此错误：

no applicable method for 'removePunctuation' applied to an object of class "data.frame"

我的数据是从文本文件中读取的聊天记录，在R：

中看起来像这样

     V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.

我用：

tdm <- TermDocumentMatrix(text,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

但是我收到了这个错误：

Error in UseMethod("TermDocumentMatrix", x) : 
  no applicable method for 'TermDocumentMatrix' applied to an object of class "data.frame"

好像我不应该将数据框输入到函数中，但我还能怎么做呢？

由于

Answer 1

正如@Martin Bel所指出的，qdap version 1.1.0也可以做到这一点。我已经为qdap添加了一些支持，以便与tm包更加兼容，包括tdm函数，该函数在这里可以正常工作：

首先阅读您的数据（我添加了冒号）：

library(qdap)
dat <- read.transcript(text="ID    V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.", header=TRUE, sep="   ")

＃制作术语文档矩阵：

tdm(dat$V1, id(dat), stopwords=tm::stopwords("en"))

＃使用tm包执行相同的操作：

TermDocumentMatrix(Corpus(VectorSource(dat[, 1])),
    control = list(
        removePunctuation = TRUE,
        stopwords = TRUE
    )
)

Answer 2

你非常接近，最快的方法是使用DataframeSource制作一个语料库对象，然后从中创建一个术语doc矩阵。使用您的示例：

让我们输入数据......

Text <- readLines(n=4)
In the process
Sorry I had to step away for a moment.
I am getting an error page that says QB is currently unavailable.
That link gives me the same error message.

df <- data.frame(V1 = Text, stringsAsFactors = FALSE)

现在将数据框转换为术语文档矩阵......

require(tm)
mycorpus <- Corpus(DataframeSource(df))
tdm <- TermDocumentMatrix(mycorpus, control = list(removePunctuation = TRUE, stopwords = TRUE))

现在检查输出......

inspect(tdm)
   A term-document matrix (14 terms, 4 documents)

Non-/sparse entries: 15/41
Sparsity           : 73%
Maximal term length: 11 
Weighting          : term frequency (tf)

             Docs
Terms         1 2 3 4
  away        0 1 0 0
  currently   0 0 1 0
  error       0 0 1 1
  getting     0 0 1 0
  gives       0 0 0 1
  link        0 0 0 1
  message     0 0 0 1
  moment      0 1 0 0
  page        0 0 1 0
  process     1 0 0 0
  says        0 0 1 0
  sorry       0 1 0 0
  step        0 1 0 0
  unavailable 0 0 1 0

Answer 3

您只需要通过执行text[,1]：

从数据框中解压缩文本

tdm <- TermDocumentMatrix(text[,1],
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

R中的TM Package清理文本

3 个答案: