使用R查找文本中的前十个单词

时间:2014-10-02 11:24:15

标签: regex r

我是R的新手,也是正则表达式的新手。我在其他讨论中找到了这个,但无法找到合适的匹配。

我有一个大型的文本数据集(书)。我使用以下代码来描述本文中的字词:

> a <- gregexpr("[a-zA-Z0-9'\\-]+", book[1])

> regmatches (book[1], a)
[[1]]
[1] "she" "runs"

我现在想要将整个数据集(书)中的所有文本拆分成单个单词,以便我可以确定整个文本中前十个单词的内容(标记它)。然后,我需要计算使用表函数对单词进行计数,然后按某种方式排序以获得前十名。

此外,有关如何计算累积分布的任何想法,即需要多少单词来覆盖所有单词的一半(50%)?

非常感谢您的回复以及您对我的基本问题的耐心。

3 个答案:

答案 0 :(得分:5)

不是正则表达式,但可能更多的是你所追求的事情,而不是大惊小怪......这是使用托马斯数据的qdap方法(PS漂亮的数据方法):

u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))

library(qdap)
freq_terms(book, 10)

##    WORD  FREQ
## 1  the  18195
## 2  of   12015
## 3  to    7177
## 4  and   5191
## 5  in    4518
## 6  a     4051
## 7  be    3846
## 8  that  2800
## 9  it    2565
## 10 is    2218

这样做的好处是你可以控制:

  1. stopwords
  2. 的停用词
  3. at.least
  4. 的最短长度字词
  5. 说明与extend = TRUE(默认)
  6. 的关系
  7. 输出的绘图方法
  8. 这里再次使用停用词和最小长度设置(通常这两个参数重叠,因为停用词往往是最小长度的单词)和一个情节:

    (ft <- freq_terms(book, 10, at.least=3, stopwords=qdapDictionaries::Top25Words))
    plot(ft)
    
    ##    WORD       FREQ
    ## 1  which      2075
    ## 2  would      1273
    ## 3  will       1257
    ## 4  not        1238
    ## 5  their      1098
    ## 6  states      864
    ## 7  may         839
    ## 8  government  830
    ## 9  been        798
    ## 10 state       792
    

    enter image description here

答案 1 :(得分:4)

获得词频:

> mytext = c("This","is","a","test","for","count","of","the","words","The","words","have","been","written","very","randomly","so","that","the","test","can","be","for","checking","the","count")

> sort(table(mytext), decreasing=T)
mytext
     the    count      for     test    words        a       be     been      can checking     have       is       of randomly       so     that      The     This     very 
       3        2        2        2        2        1        1        1        1        1        1        1        1        1        1        1        1        1        1 
 written 
       1 

忽略大小写:

> mytext = tolower(mytext)
> 
> sort(table(mytext), decreasing=T)
mytext
     the    count      for     test    words        a       be     been      can checking     have       is       of randomly       so     that     this     very  written 
       4        2        2        2        2        1        1        1        1        1        1        1        1        1        1        1        1        1        1 
> 

仅限前十个单词:

> sort(table(mytext), decreasing=T)[1:10]
mytext
     the    count      for     test    words        a       be     been      can checking 
       4        2        2        2        2        1        1        1        1        1 

答案 2 :(得分:4)

您可以使用正则表达式,但使用文本挖掘包将为您提供更多灵活性。例如,要进行基本的单词分隔,只需执行以下操作:

u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))

w <- strsplit(book, "[[:space:]]+")[[1]]
tail(sort(table(w)), 10)
# w
# which    is  that    be     a    in   and    to    of   the 
#  1968  1995  2690  3766  3881  4184  4943  6905 11896 16726

但是,如果你想,例如,能够删除常见的停用词或更好地处理大写(在上面,将意味着Hello和hello不计算在一起),你应该深入研究 tm < /强>:

library("tm")
s <- URISource(u)
corpus <- VCorpus(s)

m <- DocumentTermMatrix(corpus)
findFreqTerms(m, 600) # words appearing more than 600 times
# "all"   "and"   "are"   "been"  "but"   "for"   "from"  "have"  "its" "may"  
# "not"   "that"  "the"   "their" "they"  "this"  "which" "will"  "with" "would"

c2 <- tm_map(corpus, removeWords, stopwords("english"))
m2 <- DocumentTermMatrix(c2)
findFreqTerms(m2, 400) # words appearing more than 500 times
# [1] "can" "government" "may" "must" "one" "power" "state" "the" "will"