替换R中的单词

时间:2018-06-15 11:53:44

标签: r stringr

我对他们的同义词有所反对。在不同的数据框中,我有句子。我想从其他数据框中搜索同义词。如果找到,请将其替换为找到同义词的单词。

dt = read.table(header = TRUE, 
text ="Word Synonyms
Use 'employ, utilize, exhaust, spend, expend, consume, exercise'
Come    'advance, approach, arrive, near, reach'
Go  'depart, disappear, fade, move, proceed, recede, travel'
Run 'dash, escape, elope, flee, hasten, hurry, race, rush, speed, sprint'
Hurry   'rush, run, speed, race, hasten, urge, accelerate, bustle'
Hide    'conceal, cover, mask, cloak, camouflage, screen, shroud, veil'
", stringsAsFactors= F)


   mydf = read.table(header = TRUE, , stringsAsFactors= F,
                    text ="sentence
    'I can utilize this file'
    'I can cover these things'
    ")

所需的输出看起来像 -

I can Use this file
I can Hide these things

以上只是一个样本。在我的真实数据集中,我有超过10000个句子。

2 个答案:

答案 0 :(得分:2)

可以用,替换dt$Synonyms中的|,以便它可以用作pattern的{​​{1}}参数。现在,使用gsub作为模式,并用dt$Synonyms替换任何单词的出现(由|分隔)。可以使用dt$wordsapply作为:

已编辑:按照OP的建议添加了字边界检查(作为gsub中模式的一部分)。

gsub

答案 1 :(得分:1)

这是一个tidyverse解决方案......

library(stringr)
library(dplyr)

dt2 <- dt %>% 
  mutate(Synonyms=str_split(Synonyms, ",\\s*")) %>% #split into words
  unnest(Synonyms) #this results in a long dataframe of words and synonyms

mydf2 <- mydf %>% 
  mutate(Synonyms=str_split(sentence, "\\s+")) %>% #split into words
  unnest(Synonyms) %>% #expand to long form, one word per row
  left_join(dt2) %>% #match synonyms
  mutate(Word=ifelse(is.na(Word), Synonyms, Word)) %>% #keep unmatched words the same
  group_by(sentence) %>% 
  summarise(sentence2=paste(Word, collapse=" ")) #reconstruct sentences

mydf2

  sentence                 sentence2              
  <chr>                    <chr>                  
1 I can cover these things I can Hide these things
2 I can utilize this file  I can Use this file