如何检查字符串是否包含R中的特定单词

时间:2016-11-13 20:48:54

标签: r string

我有来自kaggle.com的辛普森数据,其中包括每集的标题。我想检查每个标题中使用的字符名称的次数。我可以找到标题中的确切单词,但是当我寻找荷马时,我的代码却错过了诸如荷马这样的单词。有办法吗?

数据示例和我的代码:

text <- 'title
Homer\'s Night Out
Krusty Gets Busted
Bart Gets an "F"
Two Cars in Every Garage and Three Eyes on Every Fish
Dead Putting Society
Bart the Daredevil
Bart Gets Hit by a Car
Homer vs. Lisa and the 8th Commandment
Oh Brother, Where Art Thou?
Old Money
Lisa\'s Substitute
Blood Feud
Mr. Lisa Goes to Washington
Bart the Murderer
Like Father, Like Clown
Saturdays of Thunder
Burns Verkaufen der Kraftwerk
Radio Bart
Bart the Lover
Separate Vocations
Colonel Homer'

simpsons <- read.csv(text = text, stringsAsFactors = FALSE)

library(stringr)

titlewords <- paste(simpsons$title, collapse = " " )
words <- c('Homer')
titlewords <- gsub("[[:punct:]]", "", titlewords)
HomerCount <- str_count(titlewords, paste(words, collapse=" "))
HomerCount

1 个答案:

答案 0 :(得分:0)

除了评论中的优秀建议之外,您还可以使用tidytext

library(tidytext)
library(dplyr)

text <- 'title
Homer\'s Night Out
Krusty Gets Busted
Bart Gets an "F"
Two Cars in Every Garage and Three Eyes on Every Fish
Dead Putting Society
Bart the Daredevil
Bart Gets Hit by a Car
Homer vs. Lisa and the 8th Commandment
Oh Brother, Where Art Thou?
Old Money
Lisa\'s Substitute
Blood Feud
Mr. Lisa Goes to Washington
Bart the Murderer
Like Father, Like Clown
Saturdays of Thunder
Burns Verkaufen der Kraftwerk
Radio Bart
Bart the Lover
Separate Vocations
Colonel Homer'

simpsons <- read.csv(text = text, stringsAsFactors = FALSE)

# Number of homers
simpsons %>%
  unnest_tokens(word, title) %>% 
  summarize(count = sum(grepl("homer", word)))

# Lines location of homers
simpsons %>% 
  unnest_tokens(word, title) %>% 
  mutate(lines = rownames(.)) %>% 
  filter(grepl("homer", word))