从twitter中提取主题标签 - R错误

时间:2015-05-23 11:04:28

标签: r twitter

我有推特数据。使用库(stringr)我已经提取了所有的weblinks。但是,当我尝试做同样的事情时,我会收到错误。几天前,相同的代码已经发挥作用。以下是代码:

 library(stringr)
 hash <- "#[a-zA-Z0-9]{1, }"
 hashtag <- str_extract_all(travel$texts, hash)

以下是错误:

 Error in stri_extract_all_regex(string, pattern, simplify = simplify,  : 
   Error in {min,max} interval. (U_REGEX_BAD_INTERVAL)

我已经重新安装了stringr包....但没有帮助。

我用于网络链接的代码是:

 pat1 <- "http://t.co/[a-zA-Z0-9]{1,}"
 twitlink <- str_extract_all(travel$texts, pat1)

可重现的例子如下:

 rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems      maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36"))) 

2 个答案:

答案 0 :(得分:1)

您的问题来自hash

中的空白
 #Not working (look the whitespace after the comma)
 str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1, }")
 #working
 str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1,}")

答案 1 :(得分:0)

您可能需要考虑使用我为此任务维护的 qdapRegex 包。它使得提取网址和哈希标签变得容易。 qdapRegex 是一个包含一堆预制正则表达式的软件包,并使用惊人的 stringi 包作为后端来执行正则表达式任务。

rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems      maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36")))

library(qdapRegex)
## first combine the built in url + twitter regexes into a function
rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)
rm_twitter_n_url(rtt$texts)

rm_hash(rtt$texts, extract=TRUE)

提供以下输出:

## > rm_twitter_n_url(rtt$texts)
## [[1]]
## [1] "httptcoLPihj2sNEP"
## 
## [[2]]
## [1] "httptconMHNlDqv69"
## 
## [[3]]
## [1] "httptcolMvChSpaBT"


## > rm_hash(rtt$texts, extract=TRUE)
## [[1]]
## [1] "#stevenewman"
## 
## [[2]]
## [1] "#Job"    "#Canada" "#Marlin" "#St"    
## 
## [[3]]
## [1] "#Fiji"       "#NewZealand"