防止rm_stopwords函数创建列表

时间:2019-01-26 16:42:17

标签: r qdap

我使用了rm_stopwords包中的qdap函数,从数据框中的文本列中删除了停用词和标点符号。

library(qdap)
library(dplyr)
library(tm)

glimpse(dat_full)
Observations: 500
Variables: 9
$ reviewerID     <chr> "ABF0ARHORHUUC", "AH4KMS2YC6TXA", "A2IXK5LB...
$ asin           <chr> "B00BE6C9S0", "B009X78DKU", "B0077PM3KG", "...
$ reviewerName   <chr> "stuartm \"stuartm\"", "HottMess", "G. Farn...
$ helpful        <list> [<1, 2>, <0, 0>, <0, 0>, <0, 0>, <0, 0>, <...
$ reviewText     <chr> "I've used the Mophie juice pack for my iPh...
$ overall        <dbl> 3, 5, 5, 5, 5, 3, 3, 5, 5, 5, 5, 4, 5, 5, 3...
$ summary        <chr> "Case issues limit utility of this device",...
$ unixReviewTime <int> 1375142400, 1355356800, 1383350400, 1367193...
$ reviewTime     <chr> "07 30, 2013", "12 13, 2012", "11 2, 2013",...

full_dat$reviewText = rm_stopwords(full_dat$reviewText, 
tm::stopwords("english"), strip = TRUE)

该函数返回reviewText列的列表。

glimpse(full_dat)
Observations: 500
Variables: 9
$ reviewerID     <chr> "ABF0ARHORHUUC", "AH4KMS2YC6TXA", "A2IXK5LB...
$ asin           <chr> "B00BE6C9S0", "B009X78DKU", "B0077PM3KG", "...
$ reviewerName   <chr> "stuartm \"stuartm\"", "HottMess", "G. Farn...
$ helpful        <list> [<1, 2>, <0, 0>, <0, 0>, <0, 0>, <0, 0>, <...
$ reviewText     <list> [<"used", "mophie", "juice", "pack", "ipho...
$ overall        <dbl> 3, 5, 5, 5, 5, 3, 3, 5, 5, 5, 5, 4, 5, 5, 3...
$ summary        <chr> "Case issues limit utility of this device",...
$ unixReviewTime <int> 1375142400, 1355356800, 1383350400, 1367193...
$ reviewTime     <chr> "07 30, 2013", "12 13, 2012", "11 2, 2013",...

关于如何防止它(保持原始格式)或取消列出/取消嵌套该列并返回原始格式的任何想法?

结果应类似于原始数据帧,但没有停用词和标点符号。

这是一个小东西:

structure(list(reviewerID = "A3LWYDTO7928SH", asin = "B00B0FT2T4", 
    reviewerName = "D. Lang", helpful = list(c(0L, 0L)), reviewText = "When I first put your glass protector on my phone I was blown away!  (I knew how &#34;degrading&#34; the soft plastic covers were - ruining my experience, so I chose not to have a protector on my screen.)  Then I saw your website and I wondered if it was as good as spoken about.  The answer is YES.  The application was flawless even after I pulled the glass back off because I had not put it on absolutely perfectly.  It repositioned with ease and you could not find a bubble if you had a microscope!  Fascinating to see the viscous material on the back spread out on its own!  Application could not be easier and the quality of the product seems like it came from NASA.", 
    overall = 5, summary = "It is as perfect as a product can get - Really!", 
    unixReviewTime = 1396569600L, reviewTime = "04 4, 2014"), row.names = 145945L, class = "data.frame")

1 个答案:

答案 0 :(得分:1)

在dplyr管道中类似这样的事情。结合使用粘贴和取消列表获取结果。

full_dat <- dat_full %>% 
  mutate(reviewText = map_chr(reviewText, 
                          function(x) paste0(unlist(qdap::rm_stopwords(x, 
                                                                       tm::stopwords("english"), 
                                                                       strip = TRUE)), 
                                             collapse = " ") 
                          )
         )