Question

我是R的新手，也是正则表达式的新手。我在其他讨论中找到了这个，但无法找到合适的匹配。

我有一个大型的文本数据集（书）。我使用以下代码来描述本文中的字词：

> a <- gregexpr("[a-zA-Z0-9'\\-]+", book[1])

> regmatches (book[1], a)
[[1]]
[1] "she" "runs"

我现在想要将整个数据集（书）中的所有文本拆分成单个单词，以便我可以确定整个文本中前十个单词的内容（标记它）。然后，我需要计算使用表函数对单词进行计数，然后按某种方式排序以获得前十名。

此外，有关如何计算累积分布的任何想法，即需要多少单词来覆盖所有单词的一半（50％）？

非常感谢您的回复以及您对我的基本问题的耐心。

Answer 1

不是正则表达式，但可能更多的是你所追求的事情，而不是大惊小怪......这是使用托马斯数据的qdap方法（PS漂亮的数据方法）：

u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))

library(qdap)
freq_terms(book, 10)

##    WORD  FREQ
## 1  the  18195
## 2  of   12015
## 3  to    7177
## 4  and   5191
## 5  in    4518
## 6  a     4051
## 7  be    3846
## 8  that  2800
## 9  it    2565
## 10 is    2218

这样做的好处是你可以控制：

stopwords
at.least
说明与extend = TRUE（默认）
输出的绘图方法

这里再次使用停用词和最小长度设置（通常这两个参数重叠，因为停用词往往是最小长度的单词）和一个情节：

(ft <- freq_terms(book, 10, at.least=3, stopwords=qdapDictionaries::Top25Words))
plot(ft)

##    WORD       FREQ
## 1  which      2075
## 2  would      1273
## 3  will       1257
## 4  not        1238
## 5  their      1098
## 6  states      864
## 7  may         839
## 8  government  830
## 9  been        798
## 10 state       792

enter image description here

Answer 2

获得词频：

> mytext = c("This","is","a","test","for","count","of","the","words","The","words","have","been","written","very","randomly","so","that","the","test","can","be","for","checking","the","count")

> sort(table(mytext), decreasing=T)
mytext
     the    count      for     test    words        a       be     been      can checking     have       is       of randomly       so     that      The     This     very 
       3        2        2        2        2        1        1        1        1        1        1        1        1        1        1        1        1        1        1 
 written 
       1

忽略大小写：

> mytext = tolower(mytext)
> 
> sort(table(mytext), decreasing=T)
mytext
     the    count      for     test    words        a       be     been      can checking     have       is       of randomly       so     that     this     very  written 
       4        2        2        2        2        1        1        1        1        1        1        1        1        1        1        1        1        1        1 
>

仅限前十个单词：

> sort(table(mytext), decreasing=T)[1:10]
mytext
     the    count      for     test    words        a       be     been      can checking 
       4        2        2        2        2        1        1        1        1        1

Answer 3

您可以使用正则表达式，但使用文本挖掘包将为您提供更多灵活性。例如，要进行基本的单词分隔，只需执行以下操作：

u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))

w <- strsplit(book, "[[:space:]]+")[[1]]
tail(sort(table(w)), 10)
# w
# which    is  that    be     a    in   and    to    of   the 
#  1968  1995  2690  3766  3881  4184  4943  6905 11896 16726

但是，如果你想，例如，能够删除常见的停用词或更好地处理大写（在上面，将意味着Hello和hello不计算在一起），你应该深入研究 tm < /强>：

library("tm") s <- URISource(u) corpus <- VCorpus(s) m <- DocumentTermMatrix(corpus) findFreqTerms(m, 600) # words appearing more than 600 times # "all" "and" "are" "been" "but" "for" "from" "have" "its" "may" # "not" "that" "the" "their" "they" "this" "which" "will" "with" "would" c2 <- tm_map(corpus, removeWords, stopwords("english")) m2 <- DocumentTermMatrix(c2) findFreqTerms(m2, 400) # words appearing more than 500 times # [1] "can" "government" "may" "must" "one" "power" "state" "the" "will"

使用R查找文本中的前十个单词

3 个答案: