我有一个名为Frankenstein.txt
的电子书文本文件,我想知道小说中每个字母使用了多少次。
我的设置:
我导入了文本文件,就像这样获得了一个字符向量character_array
string <- readChar("Frankenstein.txt", filesize)
character_array <- unlist(strsplit(string, ""))
character_array
给了我这样的东西。
"F" "r" "a" "n" "k" "e" "n" "s" "t" "e" "i" "n" "\r", ...
我的目标:
我想得到每次字符出现在文本文件中的计数。换句话说,我想得到每个unique(character_array)
[1] "F" "r" "a" "n" "k" "e" "s" "t" "i" "\r" "\n" "b" "y" "M"
[15] " " "W" "o" "l" "c" "f" "(" "G" "d" "w" ")" "S" "h" "C"
[29] "O" "N" "T" "E" "L" "1" "2" "3" "4" "p" "5" "6" "7" "8"
[43] "9" "0" "_" "." "v" "," "g" "P" "u" "D" "—" "Y" "j" "m"
[57] "I" "z" "?" ";" "x" "q" "B" "U" "’" "H" "-" "A" "!" ":"
[71] "R" "J" "“" "”" "æ" "V" "K" "[" "]" "‘" "ê" "ô" "é" "è"
我的尝试
当我打电话给plot(as.factor(character_array))
时,我得到了一个漂亮的图表,它给了我想要的视觉效果。
但是,我需要获取每个字符的确切值。我想像2D数组这样的东西:
[,1] [,2] [,3] [,4] ...
[1,] "a" "A" "b" "B" ...
[2,] "1202" "50" "12" "9" ...
答案 0 :(得分:4)
制作这类文本处理管道的一个好方法是使用magrittr::%>%
管道。这是一种方法,假设您的文本位于"frank.txt"
(有关每个步骤的说明,请参阅底部):
library(magrittr)
# read the text in
frank_txt <- readLines("frank.txt")
# then send the text down this pipeline:
frank_txt %>%
paste(collapse="") %>%
strsplit(split="") %>% unlist %>%
`[`(!. %in% c("", " ", ".", ",")) %>%
table %>%
barplot
请注意,您可以停在table()
并将结果分配给变量,然后您可以根据需要操作变量,例如:通过策划:
char_counts <- frank_txt %>% paste(collapse="") %>%
strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>%
table
barplot(char_counts)
您还可以将表格转换为数据框,以便以后更容易操作/绘图:
counts_df <- data.frame(
char = names(char_counts),
count = as.numeric(char_counts),
stringsAsFactors=FALSE)
head(counts_df)
## char count
## a 13
## b 2
## c 7
## d 5
## e 24
## f 6
解释了每个步骤:以下是完整的管道链,每个步骤都说明了:
# going to send this text down a pipeline:
frank_txt %>%
# combine lines into a single string (makes things easier downstream)
paste(collapse="") %>%
# tokenize by character (strsplit returns a list, so unlist it)
strsplit(split="") %>% unlist %>%
# remove instances of characters you don't care about
`[`(!. %in% c("", " ", ".", ",")) %>%
# make a frequency table of the characters
table %>%
# then plot them
barplot
请注意,这完全等同于以下可怕的(“怪异的”?!?!)代码 - 正向管道%>%
只是将其右侧的函数应用于值在其左侧(而.
是一个代表左侧值的代词;请参阅intro vignette):
barplot(table(
unlist(strsplit(paste(frank_txt, collapse=""), split=""))[
!unlist(strsplit(paste(frank_txt, collapse=""), split="")) %in%
c(""," ",".",",")]))
答案 1 :(得分:1)
使用gutenbergr,tidytext和dplyr,您可以执行以下操作:
library(gutenbergr)
library(tidytext)
library(dplyr)
frank <- gutenberg_download(c(84), meta_fields = "title")
删除不需要的字符,如。 []等。
frank %>%
unnest_tokens(chars, text, "characters") %>%
group_by(chars) %>%
summarise(n = n()) %>%
t() #transpose to get in order of OP
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
chars "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "a" "b" "c" "d" "e" "f"
n " 2" " 35" " 15" " 6" " 4" " 4" " 3" " 16" " 5" " 4" "25733" " 4749" " 8644" "16327" "44210" " 8341"
[,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
chars "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
n " 5564" "19194" "23483" " 413" " 1617" "12239" "10237" "23306" "23886" " 5672" " 313" "19647" "20380" "28835" " 9897" " 3717"
[,33] [,34] [,35] [,36]
chars "w" "x" "y" "z"
n " 7364" " 649" " 7578" " 239"
如果你想要这些字符,代码是这样的:
frank %>%
unnest_tokens(chars, text, stringr::str_split, pattern = "") %>%
group_by(chars) %>%
summarise(n = n()) %>%
t() #transpose to get in order of OP
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
chars "'" "-" " " "!" "\"" "(" ")" "," "." ":" ";" "?" "[" "]" "_" "0"
n " 221" " 370" "71202" " 238" " 774" " 16" " 16" " 4945" " 2904" " 48" " 970" " 220" " 3" " 3" " 2" " 2"
[,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
chars "1" "2" "3" "4" "5" "6" "7" "8" "9" "a" "b" "c" "d" "e" "f" "g"
n " 35" " 15" " 6" " 4" " 4" " 3" " 16" " 5" " 4" "25733" " 4749" " 8644" "16327" "44210" " 8341" " 5564"
[,33] [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48]
chars "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w"
n "19194" "23483" " 413" " 1617" "12239" "10237" "23306" "23886" " 5672" " 313" "19647" "20380" "28835" " 9897" " 3717" " 7364"
[,49] [,50] [,51]
chars "x" "y" "z"
n " 649" " 7578" " 239"