Question

有人可以帮我解决这个问题吗？非常感谢！

我有一些这样的数据：

    A             B
fruit     red apple
fruit   green apple
fruit  yellow apple
fruit          kiwi
fruit   golden kiwi
juice   apple juice
juice  orange juice

我希望得到以下内容：

    A             B         freq
fruit         apple            3
fruit          kiwi            2
juice         apple            1
juice        orange            1

我可以在B中提供要搜索的字符串向量（即我知道我想要查找＆＃34; apple＆＃34;，＆＃34; kiwi＆＃34;和＆＃34; orange＆＃34 ）。例如，如果有一个＆＃34;香蕉＆＃34; in＆＃34; fruit＆＃34;而且我没有＆＃34;香蕉＆＃34;在我想要搜索的项目列表中，只需显示＆＃34; banana＆＃34;在freq 1的结果中。

Answer 1

使用table计算具有特定值的观察数量：

library(stringr)
table(paste(df$A, str_extract(df$B, paste(lookingfor, collapse="|")), sep="."))
#  fruit.apple   fruit.kiwi  juice.apple juice.orange 
#            3            2            1            1

在这里，paste(lookingfor, collapse="|")生成一个正则表达式，查找您的所有单词，str_extract提取您要查找的单词，外paste组合A变量提取的值（用.分隔）和table计算每个配对的计数。

Answer 2

假设您的数据框名为df

library(dplyr)

df %>%
  mutate(categ = sapply(regmatches(B, regexec("apple|kiwi|orange",B)),'[',1)) %>%
  group_by(A,categ) %>%
  mutate(freq = n()) %>%
  select(A,B=categ,freq) %>%
  summarize(freq = first(freq))

返回

      A      B freq
1 fruit  apple    3
2 fruit   kiwi    2
3 juice  apple    1
4 juice orange    1

Answer 3

这样的事情可能对你有用。它依赖于您提供的字符串向量与原始数据中的单词完全匹配的事实。

# your data
df <- data.frame(A = rep(c("fruit", "juice"), c(5, 2)),
    B = c("red apple", "green apple", "yellow apple", "kiwi", "golden kiwi", "apple juice", "orange juice"))

# vector of strings to search for
lookingfor <- c("apple", "kiwi", "orange", "banana")

# function to split up words in df$B and find those that match to those in looking for
found <- function(longname, shortnames) {
    splitlong <- strsplit(longname, " ")[[1]]
    index <- match(splitlong, shortnames)
    res <- if(all(is.na(index))) NA else shortnames[index[!is.na(index)][1]]
    res
    }

# apply the function to your data
df$C <- sapply(df$B, found, shortnames=lookingfor)

# summarize
aggregate(data.frame(freq=!is.na(df$C)), list(A=df$A, B=df$C), sum)

Answer 4

这是一种方法。首先找出第一列中有多少“类别”。

categs <- unique(data[,1])
fruits <- c('apple','orange', 'kiwi') # or whatever
results<-matrix(ncol=3)

然后，对于categs中的每个值，搜索每种类型的已知水果。

for (j in 1:length(categs) ) {
    for (k in 1:length(fruits) ) {
        results[1 + j*(k-1),1]<-categs[j]
        results[1 + j*(k-1),2]<-fruits[j]
        results[1 + j*(k-1),1]<-sum(grepl(fruits[k],data[data[,1]==fruits[k],2]))
        }
    }

我没有测试过，所以毫无疑问我的索引错误。

Answer 5

通过获取您拥有的数据集并声明它dat，您可以执行以下操作：

library(dplyr)

dat %>%

  mutate(B = sub(' juice', '', B),
         B = ifelse(grepl(' apple', B), 'apple', B),
         B = ifelse(grepl('golden ', B), sub('golden ', '', B), B)) %>%

group_by(A, B) %>%
summarise(count = n())

其他规则必须添加到mutate语句中。

通过匹配字符串来计算R频率

5 个答案: