Question

我写了以下函数来确定文档的tf-idf：

确定tf

tf <- function(specific_word, text){
 count = 0
 list = unlist(strsplit(text, " "))

 for(word in (list)){
  if(word == specific_word){
   count = count + 1
   }
  }
 hit_rate <- count/length(list)
 return(hit_rate)
}

确定idf值

idf <- function(specific_word, texts){

  times_a_word_appears <- 0
  total_number_of_documents <- length(texts)

  for(document in texts){
    list = strsplit(document, " ")
    list = unlist(list)

    for(word in list){
      if(word == specific_word){
        times_a_word_appears = times_a_word_appears + 1
        break
      }
    }

  }
  times_a_word_appears = times_a_word_appears + 1

  idf = log(total_number_of_documents/ times_a_word_appears)
  return(idf)
 }

最后 - 确定tf-idf

tfidf <- function(specific_word, text, texts){

  x = tf(specific_word, text)
  y = idf(specific_word, texts)
  z = x * y

   print(paste0("The tf-idf value is: ", z))
}

我现在可以用它来确定这些文件的tf-idf值：

document1 = c("films is a 2000 made-for-TV horror movie directed by Richard Clabaugh. The film features several cult favorite actors, including William Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy, Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean Whalen. The film concerns a genetically engineered snake, a python, that escapes and unleashes itself on a small town. It includes the classic final girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles,
California and Malibu, California. Python was followed by two sequels: Python II (2002) and Boa vs. Python (2004), both also made-for-TV films")

document2 = c("Python, from the Greek word, is a genus of nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are recognised.[2] A member of this genus, P. reticulatus, is among the longest snakes known.")

document3 = c("The Colt Python is a .357 Magnum caliber revolver formerly manufactured by Colt's Manufacturing Company of Hartford, Connecticut. It is sometimes referred to as a Combat Magnum It was first introduced in 1955, the same year as Smith &amp; Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy Thompson, Renee Smeets and Martin Dougherty have described the Python as the finest production revolver ever made")

texts = c(document1, document2, document3)

在document1

中找到“films”的tf-idf值

word = "films"
relevant_text = document1
tfidf(word, relevant_text, texts)

但是，我现在想要的是循环遍历所有文档中的所有单词，以确定文档中评分最高的单词。

因此，文档1有点像：

words = unlist(unique(strsplit(document1, " ")))

for(word in words){
  tfidf(word, document1, texts)
  }

但是这些值应该存储在一个数组中并进行排名。在python中有点像这样：

scores = {word: tfidf(word, document1, texts) for word in document1.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)

关于如何在R中最有效地完成这项工作的任何想法？

Answer 1

我建议您查看CRAN Task View: Natural Language Processing上列出的软件包。有几个包用于创建文档术语矩阵的整个过程，包括规范化或tfidf加权。他们的小插图还展示了许多下游任务，如regession模型或聚类分类，主题建模等。

下面我使用了其中一个软件包，即div.post.overlay来解决创建tfidf加权文档术语矩阵的任务。

我希望有所帮助。

text2vec

在R中的数组中对tf-idf得分进行排名

1 个答案: