我使用pos / negWords跟踪data.frame和词典:
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super")
negWords <- c("hate","bad","not good","horrible")
以下函数,即每个句子中的单词与字典中的pos / negWords匹配,并根据出现频率计算情感值 - 但它是精确匹配。
# descending order for words length (prepare data for function below)
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL
scoreSentence <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
match <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(match,sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), '', sentence) # remove words which were matched
} score
}
通过调用生成所需的输出:
SentimentScore <- unlist(lapply(sent$words, scoreSentence))
bbb <- cbind(sent, SentimentScore)
这导致提到所需的输出:
words user SentimentScore
1 just right size and i love this notebook 1 2
2 benefits great laptop 2 2
3 wouldnt bad notebook 3 -1
4 very good quality 4 1
5 orgtop 5 0
6 great improvement 6 1
7 notebook is not good but i love batterytop 7 0
出于那些目的,使用了循环,但我有7000个pos / negWords和200.000个句子,所以它是无止境的...
请你有更好的解决方案来完成这项任务。主要是在SentimentScore中得到相同的结果: - )
我将非常感谢您的任何建议或解决方案。非常感谢提前。
答案 0 :(得分:0)
首先,您应该运行data.frame的子元素,因为在lapply
期间调整大小可能会产生巨大的开销:
ptm = proc.time(); f=lapply(1:100000, function(X){X}); print(proc.time()-ptm)
user system elapsed
0.056 0.004 0.061
ptm = proc.time(); f=lapply(1:1000000, function(X){X}); print(proc.time()-ptm)
user system elapsed
1.112 0.004 1.119
这里序列大小的因子10在计算时间中产生因子21。所以使用小列表,然后将它们连接在一个大的列表中。
与其扩展相比,大数据框架的声明并不需要很长时间,所以你必须声明它然后用你的子列表填充它:
bbb = data.frame( words=sent[1], user=sent[2], scoreSentence=rep(0, nrow(sent)) )
MAX_SIZE = 10000
for ( ii in 0:(ceiling(nrow(sent)/MAX_SIZE)-1) ) {
selected_rows = (1 + ii * MAX_SIZE):min( (ii+1)*MAX_SIZE, nrow(sent) )
bbb[selected_rows, "scoreSentence"] = unlist(lapply(sent$words[selected_rows], scoreSentence))
}
MAX_SIZE
必须足够大,因为for
循环比lapply
慢(你希望尽可能少地循环通过for
)但不要太大或者列表扩展开销会使程序变慢。
替代并行化
并行化是一种通过在不同的核心上运行它们来更快地完成一组复杂计算的好方法。在你的情况下,我们通过发送大块句子使计算变得复杂。
使用mclapply
包中的parallel
,您将每个块发送到不同的线程,每个线程都很快,因为chunck不是太大。需要处理向量的scoreSentence
包装器:
bbb = data.frame( words=sent[1], user=sent[2], scoreSentence=rep(0, nrow(sent)) )
MAX_SIZE = 10000
mc_list = list()
mc_list[[ceiling(nrow(sent)/MAX_SIZE)]] = 0
for ( ii in 0:(ceiling(nrow(sent)/MAX_SIZE)-1) ) {
mc_list[[ii+1]] = (1 + ii * MAX_SIZE):min( (ii+1)*MAX_SIZE, nrow(sent) )
}
bbb[,"scoreSentence"] = unlist(mclapply(mc_list, scoreSentenceWrapper))
scoreSentenceWrapper <- function(selected_rows) {
return(unlist(lapply(sent$words[selected_rows], scoreSentence)))
}