R中矢量的字符频率

时间:2018-03-18 16:09:29

标签: r

我有一个名为Frankenstein.txt的电子书文本文件,我想知道小说中每个字母使用了多少次。

我的设置:

我导入了文本文件,就像这样获得了一个字符向量character_array

string <- readChar("Frankenstein.txt", filesize)
character_array <- unlist(strsplit(string, ""))

character_array给了我这样的东西。

 "F" "r" "a" "n" "k" "e" "n" "s" "t" "e" "i" "n" "\r", ...

我的目标:

我想得到每次字符出现在文本文件中的计数。换句话说,我想得到每个unique(character_array)

的计数
 [1] "F"  "r"  "a"  "n"  "k"  "e"  "s"  "t"  "i"  "\r" "\n" "b"  "y"  "M" 
 [15] " "  "W"  "o"  "l"  "c"  "f"  "("  "G"  "d"  "w"  ")"  "S"  "h"  "C" 
 [29] "O"  "N"  "T"  "E"  "L"  "1"  "2"  "3"  "4"  "p"  "5"  "6"  "7"  "8" 
 [43] "9"  "0"  "_"  "."  "v"  ","  "g"  "P"  "u"  "D"  "—"  "Y"  "j"  "m" 
 [57] "I"  "z"  "?"  ";"  "x"  "q"  "B"  "U"  "’"  "H"  "-"  "A"  "!"  ":" 
 [71] "R"  "J"  "“"  "”"  "æ"  "V"  "K"  "["  "]"  "‘"  "ê"  "ô"  "é"  "è" 

我的尝试 当我打电话给plot(as.factor(character_array))时,我得到了一个漂亮的图表,它给了我想要的视觉效果。 enter image description here  但是,我需要获取每个字符的确切值。我想像2D数组这样的东西:

    [,1]   [,2] [,3] [,4] ... 
[1,] "a"    "A"  "b"  "B" ...
[2,] "1202" "50" "12" "9" ...

2 个答案:

答案 0 :(得分:4)

制作这类文本处理管道的一个好方法是使用magrittr::%>% 管道。这是一种方法,假设您的文本位于"frank.txt"(有关每个步骤的说明,请参阅底部):

library(magrittr)

# read the text in 
frank_txt <- readLines("frank.txt")

# then send the text down this pipeline:
frank_txt %>% 
  paste(collapse="") %>% 
  strsplit(split="") %>% unlist %>% 
  `[`(!. %in% c("", " ", ".", ",")) %>% 
  table %>% 
  barplot

请注意,您可以停在table()并将结果分配给变量,然后您可以根据需要操作变量,例如:通过策划:

char_counts <- frank_txt %>% paste(collapse="") %>% 
  strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>%
  table

barplot(char_counts)

您还可以将表格转换为数据框,以便以后更容易操作/绘图:

counts_df <- data.frame(
  char = names(char_counts), 
  count = as.numeric(char_counts), 
  stringsAsFactors=FALSE)

head(counts_df)
## char count
##   a    13
##   b     2
##   c     7
##   d     5
##   e    24
##   f     6



解释了每个步骤:以下是完整的管道链,每个步骤都说明了:

# going to send this text down a pipeline:
frank_txt %>% 
  # combine lines into a single string (makes things easier downstream)
  paste(collapse="") %>% 
  # tokenize by character (strsplit returns a list, so unlist it)
  strsplit(split="") %>% unlist %>% 
  # remove instances of characters you don't care about
  `[`(!. %in% c("", " ", ".", ",")) %>% 
  # make a frequency table of the characters
  table %>% 
  # then plot them
  barplot

请注意,这完全等同于以下可怕的(“怪异的”?!?!)代码 - 正向管道%>%只是将其右侧的函数应用于值在其左侧(而.是一个代表左侧值的代词;请参阅intro vignette):

barplot(table(
  unlist(strsplit(paste(frank_txt, collapse=""), split=""))[
    !unlist(strsplit(paste(frank_txt, collapse=""), split="")) %in% 
      c(""," ",".",",")]))

答案 1 :(得分:1)

使用gutenbergr,tidytext和dplyr,您可以执行以下操作:

library(gutenbergr)
library(tidytext)
library(dplyr)

frank <- gutenberg_download(c(84), meta_fields = "title")

删除不需要的字符,如。 []等。

frank %>% 
  unnest_tokens(chars, text, "characters") %>% 
  group_by(chars) %>% 
  summarise(n = n()) %>% 
  t() #transpose to get in order of OP
      [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]    [,10]   [,11]   [,12]   [,13]   [,14]   [,15]   [,16]  
chars "0"     "1"     "2"     "3"     "4"     "5"     "6"     "7"     "8"     "9"     "a"     "b"     "c"     "d"     "e"     "f"    
n     "    2" "   35" "   15" "    6" "    4" "    4" "    3" "   16" "    5" "    4" "25733" " 4749" " 8644" "16327" "44210" " 8341"
      [,17]   [,18]   [,19]   [,20]   [,21]   [,22]   [,23]   [,24]   [,25]   [,26]   [,27]   [,28]   [,29]   [,30]   [,31]   [,32]  
chars "g"     "h"     "i"     "j"     "k"     "l"     "m"     "n"     "o"     "p"     "q"     "r"     "s"     "t"     "u"     "v"    
n     " 5564" "19194" "23483" "  413" " 1617" "12239" "10237" "23306" "23886" " 5672" "  313" "19647" "20380" "28835" " 9897" " 3717"
      [,33]   [,34]   [,35]   [,36]  
chars "w"     "x"     "y"     "z"    
n     " 7364" "  649" " 7578" "  239"

如果你想要这些字符,代码是这样的:

frank %>% 
  unnest_tokens(chars, text, stringr::str_split, pattern = "") %>% 
  group_by(chars) %>% 
  summarise(n = n()) %>% 
  t() #transpose to get in order of OP

      [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]    [,10]   [,11]   [,12]   [,13]   [,14]   [,15]   [,16]  
chars "'"     "-"     " "     "!"     "\""    "("     ")"     ","     "."     ":"     ";"     "?"     "["     "]"     "_"     "0"    
n     "  221" "  370" "71202" "  238" "  774" "   16" "   16" " 4945" " 2904" "   48" "  970" "  220" "    3" "    3" "    2" "    2"
      [,17]   [,18]   [,19]   [,20]   [,21]   [,22]   [,23]   [,24]   [,25]   [,26]   [,27]   [,28]   [,29]   [,30]   [,31]   [,32]  
chars "1"     "2"     "3"     "4"     "5"     "6"     "7"     "8"     "9"     "a"     "b"     "c"     "d"     "e"     "f"     "g"    
n     "   35" "   15" "    6" "    4" "    4" "    3" "   16" "    5" "    4" "25733" " 4749" " 8644" "16327" "44210" " 8341" " 5564"
      [,33]   [,34]   [,35]   [,36]   [,37]   [,38]   [,39]   [,40]   [,41]   [,42]   [,43]   [,44]   [,45]   [,46]   [,47]   [,48]  
chars "h"     "i"     "j"     "k"     "l"     "m"     "n"     "o"     "p"     "q"     "r"     "s"     "t"     "u"     "v"     "w"    
n     "19194" "23483" "  413" " 1617" "12239" "10237" "23306" "23886" " 5672" "  313" "19647" "20380" "28835" " 9897" " 3717" " 7364"
      [,49]   [,50]   [,51]  
chars "x"     "y"     "z"    
n     "  649" " 7578" "  239"