Question

我正在尝试编写一个函数来从某些文本中获取特定单词的出现频率。然后使用此函数计算数据帧中每一行的选定单词的频率。

到目前为止，我正在做的是创建一个函数，该函数接受字符串和模式（即str，pattern）的输入。由于grep捕获了字符串中的所有模式，因此我认为length将负责捕获所选模式的频率。

word_count = function(str,pattern) {
   string = gsub("[[:punct:]]","",strsplit(str," "))
   x = grep("pattern",string,value=TRUE)
   return(length(x))
 }

对于数据帧（my_df），它看起来像这样：

id                      description
123  "It is cozy and pretty comfy. I think you will have good time 
     here."
232  "NOT RECOMMENDED whatsover. You will suffer here."
3333 "BEACHES are awesome overhere!! Highly recommended!!"

...so forth(more than obs.15000)

我实际上已经将所有描述部分都转换为小写，所以实际上更像是这样：

id                      description
123  "it is cozy and pretty comfy. i think you will have good time 
     here."
232  "not recommended whatsover. you will suffer here."
3333 "beaches are awesome overhere!! highly recommended!!"

...so forth(more than obs.15000)

然后我真正希望函数执行的操作：

word_count(my_df$description[1],recommended)
[1] 0 

word_count(my_df$description[3],highly)
[1] 1

但是它在做什么：

word_count(my_df$description[1],recommended)
[1] 2 

word_count(my_df$description[3],highly)
[1] 2

这实际上是返回错误的答案。希望我想使用此函数将其应用于数据框中的所有行，并且我计划使用if来实现。但是，在测试单个行时，它似乎并没有完成我想要的工作。

Answer 1

您可以将功能更改为

word_count = function(str,pattern) {
   sum(grepl(pattern, strsplit(str, " ")[[1]]))
}

我们首先在空白（" "）上分割字符串，然后使用pattern在每个单词中搜索grepl。当grepl返回TRUE / FALSE值以计算发生pattern的次数时，我们可以直接使用sum。

然后，当您尝试该函数时，它将返回您的预期输出。

word_count(df$description[1],"recommended")
#[1] 0
word_count(df$description[3],"highly")
#[1] 1

但是，请注意，str_count中有stringr函数，可以直接为您提供每一行的出现次数

stringr::str_count(df$description, "recommended")
#[1] 0 1 1

用于获取字符串中某个单词的频率的通用函数

1 个答案: