计算特定单词后的单词频率

时间:2016-09-12 12:18:43

标签: r text

我有很多推文作为文字。

我想知道特定单词后的单词频率。 例如,我有这些推文,我想知道"爱"之后的频率:

My love is... 
My love is...
the love was...
the love were...

得到这个结果:

word    next word  frequency

Love    is         2
Love    was        1
Love    were       1  

或所有单词

word    next word  frequency

My      Love       2
the     love       2
Love    is         2
Love    was        1
Love    were       1

2 个答案:

答案 0 :(得分:2)

以下程序可能有所帮助。

Step1(可选):创建一些示例数据

example <- c("my love is","my love is","banana","apple","the love was","the love were")

此向量看起来像

"my love is"    "my love is"    "banana"        "apple"         "the love was"  "the love were"

步骤2:获取包含单词&#34; love&#34;

的所有矢量条目
ex2 <- example[grep("love",example)]

给你

"my love is"    "my love is"    "the love was"  "the love were"

步骤3:构建一个单词表格,这个单词出现在&#34; love&#34;

之后
ex3 <- table(gsub(".*love","",ex2))

给你

   is   was  were 
    2     1     1 

答案 1 :(得分:2)

当你处理几个单词组合(第一个X秒)时,我没有看到任何避免使用循环的方法。下面的功能应该做你想要的:

phrase <- c("My love is... ","My love is...","A love was...","the dogs were...")
SPLIT <- matrix(unlist(strsplit(phrase," ")),nrow=length(phrase),byrow=T)
vect <- as.data.frame(cbind(unique(expand.grid(SPLIT[,1],SPLIT[,2])),freq=NA))
to.find <- paste(vect[,1],vect[,2],sep=" ")
for (i in 1:length(to.find)) {
vect[i,3] <- length(grep(to.find[i],phrase))}
vect <- subset(vect,freq>0)
vect

vect
    Var1 Var2 freq
 1    My love    2
 3     A love    1
 16  the dogs    1