编写一个函数,使用R查找文本字符串中最常见的单词

时间:2017-10-12 13:47:48

标签: r string

我需要编写一个函数来查找文本字符串中最常见的单词,这样如果我将“单词”定义为任何单词序列。

它可以返回最常用的单词。

4 个答案:

答案 0 :(得分:4)

出于一般目的,最好在boundary("word")中使用stringr

library(stringr)
most_common_word <- function(s){
    which.max(table(s %>% str_split(boundary("word"))))
}
sentence <- "This is a very short sentence. It has only a few words: a, a. a"
most_common_word(sentence)

答案 1 :(得分:2)

希望这会有所帮助:

   most_common_word=function(x){

      #Split every word into single words for counting
      splitTest=strsplit(x," ")

      #Counting words
      count=table(splitTest)

      #Sorting to select only the highest value, which is the first one
      count=count[order(count, decreasing=TRUE)][1]

      #Return the desired character. 
      #By changing this you can choose whether it show the number of times a word repeats
      return(names(count))
      }

您可以使用return(count)来显示该字词加上重复的时间。当两个单词重复相同的次数时,此功能有问题,所以要小心。

order函数获得最高值(与decreasing=TRUE一起使用时),然后它取决于名称,它们按字母顺序排序。如果'a''b'重复相同的次数,'a'函数只会显示most_common_word

答案 2 :(得分:2)

这是我设计的一个功能。请注意,我根据空格分割了字符串,删除了任何前导或滞后的空格,我也删除了“。”,并将所有大写字母转换为小写。最后,如果有平局,我总是报告第一个字。这些是您应该为自己的分析考虑的假设。

# Create example string
string <- "This is a very short sentence. It has only a few words."

library(stringr)

most_common_word <- function(string){
  string1 <- str_split(string, pattern = " ")[[1]] # Split the string
  string2 <- str_trim(string1) # Remove white space
  string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
  string4 <- tolower(string3) # Convert to lower case
  word_count <- table(string4) # Count the word number
  return(names(word_count[which.max(word_count)][1])) # Report the most common word
}

most_common_word(string)
[1] "a"

答案 3 :(得分:1)

使用tidytext包,利用已建立的解析函数:

library(tidytext)
library(dplyr)
word_count <- function(test_sentence) {
unnest_tokens(data.frame(sentence = test_sentence, 
    stringsAsFactors = FALSE), word, sentence) %>% 
count(word, sort = TRUE)
}

word_count("This is a very short sentence. It has only a few words.")

这会为您提供包含所有字数的表格。您可以调整功能以获得最重要的功能,但请注意,有时首先会有联系,所以也许它应该足够灵活以提取多个获胜者。