使用tidytext删除包含停用词的ngram

时间:2019-03-20 15:11:41

标签: r tidyverse tidytext

更新:感谢您到目前为止的输入。我重写了这个问题,并添加了一个更好的示例,以突出我的第一个示例未涵盖的隐式要求。

问题 我正在寻找一种通用的tidy解决方案,以删除包含停用词的ngram。简而言之,ngram是由空格分隔的单词字符串。字母组合包含1个单词,双字母组合包含2个单词,依此类推。我的目标是在使用unnest_tokens()之后将其应用于数据帧。该解决方案应适用于包含任意长度(uni,bi,tri ..)或至少bi&tri及以上的ngram混合的数据帧。

新示例数据

ngram_df <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                    "the",
          1,              "the basis",
          1,                  "basis",
          1,       "basis of culture",
          1,                "culture",
          1,        "is ground water",
          1,           "ground water",
          1, "ground water treatment"
  )
stopword_df <- tibble::tribble(
  ~word, ~lexicon,
  "the", "custom",
   "of", "custom",
   "is", "custom"
  )
desired_output <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                  "basis",
          1,                "culture",
          1,           "ground water",
          1, "ground water treatment"
  )

reprex package(v0.2.1)于2019-03-21创建

期望的行为

  • 应使用ngram_dfdesired_output列的停用词将word转换为stopword_df
  • 每行包含停用词的行都应删除
  • 应遵守单词边界(即寻找is不应删除basis


我第一次尝试以下reprex:

示例数据

library(tidyverse)
library(tidytext)
df <- "Groundwater remediation is the process that is used to treat polluted groundwater by removing the pollutants or converting them into harmless products." %>% 
  enframe() %>% 
  unnest_tokens(ngrams, value, "ngrams", n = 2)
#apply magic here

df
#> # A tibble: 21 x 2
#>     name ngrams                 
#>    <int> <chr>                  
#>  1     1 groundwater remediation
#>  2     1 remediation is         
#>  3     1 is the                 
#>  4     1 the process            
#>  5     1 process that           
#>  6     1 that is                
#>  7     1 is used                
#>  8     1 used to                
#>  9     1 to treat               
#> 10     1 treat polluted         
#> # ... with 11 more rows

停用词示例

stopwords <- c("is", "the", "that", "to")

所需的输出

#> Source: local data frame [9 x 2]
#> Groups: <by row>
#> 
#> # A tibble: 9 x 2
#>    name ngrams                 
#>   <int> <chr>                  
#> 1     1 groundwater remediation
#> 2     1 treat polluted         
#> 3     1 polluted groundwater   
#> 4     1 groundwater by         
#> 5     1 by removing            
#> 6     1 pollutants or          
#> 7     1 or converting          
#> 8     1 them into              
#> 9     1 harmless products

reprex package(v0.2.1)于2019-03-20创建

(来自https://en.wikipedia.org/wiki/Groundwater_remediation的例句)

1 个答案:

答案 0 :(得分:0)

在这里,您还有另一种使用上一个答案中的“ stopwords_collapsed”的方法:

swc <- paste(stopwords, collapse = "|")
df <- df[str_detect(df$ngrams, swc) == FALSE, ] #select rows without stopwords

df
# A tibble: 8 x 2
   name ngrams                 
  <int> <chr>                  
1     1 groundwater remediation
2     1 treat polluted         
3     1 polluted groundwater   
4     1 groundwater by         
5     1 by removing            
6     1 pollutants or          
7     1 or converting          
8     1 harmless products 

这里有一个比较两个系统的简单基准:

#benchmark
txtexp <- rep(txt,1000000)
dfexp <- txtexp %>% 
    enframe() %>% 
    unnest_tokens(ngrams, value, "ngrams", n = 2)

benchmark("mutate+filter (small text)" = {df1 <- df %>%
        mutate(
            has_stop_word = str_detect(ngrams, stopwords_collapsed)
        ) %>%
        filter(!has_stop_word)},
          "[] row selection (small text)" = {df2 <- df[str_detect(df$ngrams, stopwords_collapsed) == FALSE, ]},
        "mutate+filter (large text)" = {df3 <- dfexp %>%
            mutate(
                has_stop_word = str_detect(ngrams, stopwords_collapsed)
            ) %>%
            filter(!has_stop_word)},
        "[] row selection (large text)" = {df4 <- dfexp[str_detect(dfexp$ngrams, stopwords_collapsed) == FALSE, ]},
          replications = 5,
          columns = c("test", "replications", "elapsed")
)

                           test replications elapsed
4 [] row selection (large text)            5   30.03
2 [] row selection (small text)            5    0.00
3    mutate+filter (large text)            5   30.64
1    mutate+filter (small text)            5    0.00