Question

我想使用正则表达式从数据框中的文本中提取所有URL到新列。我有一些旧代码用于提取关键字，因此我希望调整代码以使用正则表达式。我想将正则表达式保存为字符串变量并在此处应用：

data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))

似乎fixed=FALSE应该告诉grepl它是一个正则表达式，但R并不像我试图将正则表达式保存为：

regex <- "http.*?1-\\d+,\\d+"

我的数据按照以下数据框进行组织：

data <- read.table(text='"Content"     "date"   
 1     "a house a home https://www.foo.com"     "12/31/2013"
 2     "cabin ideas https://www.example.com in the woods"     "5/4/2013"
 3     "motel is a hotel"   "1/4/2013"', header=TRUE)

希望看起来像：

                                           Content       date              ContentURL
1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
3                                 motel is a hotel   1/4/2013

Answer 1

Hadleyverse解决方案（stringr包）具有合适的网址格式：

library(stringr)

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

data$ContentURL <- str_extract(data$Content, url_pattern)

data

##                                            Content       date              ContentURL
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

如果str_extract_all中有倍数，您可以使用Content，但之后会在您的结尾处进行一些额外的处理。

Answer 2

以下是使用qdapRegex库的一种方法：

library(qdapRegex)
data[["url"]] <- unlist(rm_url(data[["Content"]], extract=TRUE))
data

##                                            Content       date                     url
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

要查看函数使用的正则表达式（因为qdapRegex旨在帮助分析和教育正则表达式），您可以使用grab函数，其函数名称前缀为@：

grab("@rm_url")

## [1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

grepl告诉你这个字符串包含的逻辑输出是否包含它。 grep告诉你索引或给出值，但值是你想要的子串的整个字符串。

因此，要将此正则表达式传递给base或 stringi 包（ qdapRegex 包装 stingi 进行提取），您可以执行以下操作：

regmatches(data[["Content"]], gregexpr(grab("@rm_url"), data[["Content"]], perl = TRUE))

library(stringi)
stri_extract(data[["Content"]], regex=grab("@rm_url"))

我确定还有 stringr 方法，但我不熟悉该软件包。

Answer 3

在太空中拆分然后找到＆＃34; http＆＃34;：

data$ContentURL <- unlist(sapply(strsplit(as.character(data$Content), split = " "),
                                 function(i){
                                   x <- i[ grepl("http", i)]
                                   if(length(x) == 0) x <- NA
                                   x
                                 }))


data
#                                            Content       date              ContentURL
# 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
# 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
# 3                                 motel is a hotel   1/4/2013                    <NA>

Answer 4

您可以使用软件包 unglue ：

library(unglue)
unglue_unnest(data,Content, "{=.*?}{url=http[^ ]*}{=.*?}",remove = FALSE)
#>                                            Content       date                       url
#> 1               a house a home https://www.f00.com 12/31/2013 1     https://www.f00.com
#> 2 cabin ideas https://www.example.com in the woods   5/4/2013 2 https://www.example.com
#> 3                                 motel is a hotel   1/4/2013 3                    <NA>

{=.*?}匹配任何内容，并且未分配给提取的列，因此=的lhs为空
{url=http[^ ]*}匹配以http开头且后跟非空格的内容，因为lhs为url，因此将其提取到url

Ps：由于SO限制，我在答案中将foo手动更改为f00

使用regex将URL解压缩到新的数据框列中

4 个答案: