我有很多关键字需要与更大的文档集进行比较并计算出现次数。
由于计算需要数小时,我决定尝试并行处理。在这个论坛上,我找到了并行包的mclapply函数,这似乎很有帮助。
作为R的新手,我无法使代码正常工作(请参阅下面的简短版本)。更具体地说,我得到了错误:
“get中的错误(as.character(FUN),mode =”function“,envir = envir): 没有找到“'功能'模式的对象'FUN'”
rm(list=ls())
df <- c("honda civic 1988 with new lights","toyota auris 4x4 140000 km","nissan skyline 2.0 159000 km")
keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
countstrings <- function(x){str_count(x, paste(sprintf("\\b%s\\b", keywords), collapse = '|'))}
# Normal way with one processor
number_of_keywords <- countstrings(df)
# Result: [1] 3 2 2
# Attempt at parallel processing
library(stringr)
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
number_of_keywords <- mclapply(cl, countstrings(df))
stopCluster(cl)
#Error in get(as.character(FUN), mode = "function", envir = envir) :
#object 'FUN' of mode 'function' was not found
任何帮助都是适用的!
答案 0 :(得分:1)
此功能应该更快。这是使用parSapply
使用并行处理的另一种方法(这会返回一个向量而不是列表):
# function to count
count_strings <- function(x, words)
{
sum(unlist(strsplit(x, ' ')) %in% words)
}
library(stringr)
library(parallel)
mcluster <- makecluster(detectCores()) # using all cores
number_of_keywords <- parSapply(mcluster, df, count_strings, keywords, USE.NAMES=F)
[1] 3 2 2