计算字符串出现的次数(频率)

时间:2016-04-27 00:03:08

标签: r count grep word-frequency

我的数据框中有一列如下

   Col1
   ----------------------------------------------------------------------------
   Center for Animal Control, Division of Hypertension, Department of Medicine
   Department of Surgery, Division of Primary Care, Center for Animal Control
   Department of Internal Medicine, Division of Hypertension, Center for Animal Control

如何计算以逗号分隔的字符串数量,换句话说,我想要完成的内容如下所示

    Affiliation                         Freq
    ------------------------------------------
    Center for Animal Control           3
    Division of Hypertension            2
    Department of Medicine              1
    Department of Surgery               1
    Division of Primary Care            1
    Department of Internal Medicine     1  

有人可以帮我解决这个问题吗?

5 个答案:

答案 0 :(得分:1)

这是一种方法。同时用逗号替换'\n',因为文本中有一些新行。

df <- data.frame(col1 = rep("Center for Animal Control, Division of Hypertension, Department of Medicine, Department of Surgery, Division of Primary Care, Center for Animal Control, Department of Internal Medicine, Division of Hypertension, Center for Animal Control", 1), stringsAsFactors = FALSE)
df$col1 <- gsub('\\n', ', ', df$col1)
as.data.frame(table(unlist(strsplit(df$col1, ', '))))

输出如下(原始数据):

                             Var1 Freq
1       Center for Animal Control    3
2 Department of Internal Medicine    1
3          Department of Medicine    1
4           Department of Surgery    1
5        Division of Hypertension    2
6        Division of Primary Care    1

答案 1 :(得分:1)

假设:Center for Animal Control, Division of Hypertension, Department of Medicine为第1行的值,Department of Surgery, Division of Primary Care, Center for Animal Control为第2行,依此类推。

df是数据框。

aff_val <- trimws(unlist(strsplit(df$col1,",")))

ans <- data.frame(table(aff_val))

colnames(ans)[1] <- 'Affiliation'

答案 2 :(得分:1)

我使用scantrimws进行这些文字处理任务。

inp <- "    Center for Animal Control, Division of Hypertension, Department of Medicine
    Department of Surgery, Division of Primary Care, Center for Animal Control
    Department of Internal Medicine, Division of Hypertension, Center for Animal Control"

> table( trimws(scan(text=inp, what="", sep=",")))
Read 9 items

      Center for Animal Control Department of Internal Medicine 
                              3                               1 
         Department of Medicine           Department of Surgery 
                              1                               1 
       Division of Hypertension        Division of Primary Care 
                              2                               1 

还可以围绕该结果包装as.data.frame:

> as.data.frame(table(  trimws(scan(text=inp, what="", sep=","))))
Read 9 items
                             Var1 Freq
1       Center for Animal Control    3
2 Department of Internal Medicine    1
3          Department of Medicine    1
4           Department of Surgery    1
5        Division of Hypertension    2
6        Division of Primary Care    1

答案 3 :(得分:0)

/srv/shiny-server/MyShinyApp

答案是

library(stringr)
a<-"Center for Animal Control, Division of Hypertension, Department of Medicine
Department of Surgery, Division of Primary Care, Center for Animal Control
Department of Internal Medicine, Division of Hypertension, Center for Animal Control"
con<-textConnection(a)
tbl<-read.table(con,sep=",")
vec<-str_trim(unlist(tbl))
as.data.frame(table(vec))

答案 4 :(得分:0)

text = "Center for Animal Control, Division of Hypertension, Department of Medicine
Department of Surgery, Division of Primary Care, Center for Animal Control
Department of Internal Medicine, Division of Hypertension, Center for Animal Control"

library(stringi)
library(dplyr)
library(tidyr)

data_frame(text = text) %>%
  mutate(line = text %>% stri_split_fixed("\n") ) %>%
  unnest(line) %>%
  mutate(phrase = line %>% stri_split_fixed(", ") ) %>%
  unnest(phrase) %>%
  count(phrase)