Question

我需要编写一个函数来查找文本字符串中最常见的单词，这样如果我将“单词”定义为任何单词序列。

它可以返回最常用的单词。

Answer 1

出于一般目的，最好在boundary("word")中使用stringr：

library(stringr)
most_common_word <- function(s){
    which.max(table(s %>% str_split(boundary("word"))))
}
sentence <- "This is a very short sentence. It has only a few words: a, a. a"
most_common_word(sentence)

Answer 2

希望这会有所帮助：

   most_common_word=function(x){

      #Split every word into single words for counting
      splitTest=strsplit(x," ")

      #Counting words
      count=table(splitTest)

      #Sorting to select only the highest value, which is the first one
      count=count[order(count, decreasing=TRUE)][1]

      #Return the desired character. 
      #By changing this you can choose whether it show the number of times a word repeats
      return(names(count))
      }

您可以使用return(count)来显示该字词加上重复的时间。当两个单词重复相同的次数时，此功能有问题，所以要小心。

order函数获得最高值（与decreasing=TRUE一起使用时），然后它取决于名称，它们按字母顺序排序。如果'a'和'b'重复相同的次数，'a'函数只会显示most_common_word。

Answer 3

这是我设计的一个功能。请注意，我根据空格分割了字符串，删除了任何前导或滞后的空格，我也删除了“。”，并将所有大写字母转换为小写。最后，如果有平局，我总是报告第一个字。这些是您应该为自己的分析考虑的假设。

# Create example string
string <- "This is a very short sentence. It has only a few words."

library(stringr)

most_common_word <- function(string){
  string1 <- str_split(string, pattern = " ")[[1]] # Split the string
  string2 <- str_trim(string1) # Remove white space
  string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
  string4 <- tolower(string3) # Convert to lower case
  word_count <- table(string4) # Count the word number
  return(names(word_count[which.max(word_count)][1])) # Report the most common word
}

most_common_word(string)
[1] "a"

Answer 4

使用tidytext包，利用已建立的解析函数：

library(tidytext)
library(dplyr)
word_count <- function(test_sentence) {
unnest_tokens(data.frame(sentence = test_sentence, 
    stringsAsFactors = FALSE), word, sentence) %>% 
count(word, sort = TRUE)
}

word_count("This is a very short sentence. It has only a few words.")

这会为您提供包含所有字数的表格。您可以调整功能以获得最重要的功能，但请注意，有时首先会有联系，所以也许它应该足够灵活以提取多个获胜者。

编写一个函数，使用R查找文本字符串中最常见的单词

4 个答案: