从字符串中提取元素

时间:2014-10-27 23:29:16

标签: r stringr

假设我有以下数据集,其中列的结构如下。

df1 = data.frame(Date=c(rnorm(5)),  
                 "United States) New York (NY" = c(rnorm(5)), 
                 "United States) Chicago (Illinois" = c(rnorm(5)),
                 "United States) Denver (Colorado" = c(rnorm(5)),
                 "United States) Seattle (Washington" = c(rnorm(5)),
                 "United States) Minneapolis (Minnesota" = c(rnorm(5)), check.names=FALSE)
df1

df2 = data.frame(Date=c(rnorm(5)),
                 "New York (New York, United States)" = c(rnorm(5)),
                 "Phoenix (Arizona, United States)" = c(rnorm(5)),
                 "Chicago (Illinois, United States)" = c(rnorm(5)),
                 "Los Angeles (California, United States)" = c(rnorm(5)), check.names=FALSE)
df2

如您所见,每列都代表一个城市,但列名称的结构是不可管理的。我想知道是否有人可以帮我弄清楚如何从列名字符串中提取城市名称。

我可以有一个每个城市的字典并进行字符串匹配,但我没有那么幸运。我还假设有一种方法可以用str_split做到这一点,但我还没想到它。

sapply(str_split(names(df1),")"), 2)

当然,我确定还有一个gsub解决方案,但在正则表达式方面,我有点不自在。

最终,我只想将实际的城市名称作为列名。

New York, Chicago, Denver, Seattle, Minneapolis

1 个答案:

答案 0 :(得分:3)

您可以使用gsub。试试第一个数据框

gsub(".*[)] (.*) [(].*", "\\1", names(df1)[-1])
# [1] "New York"    "Chicago"     "Denver"      "Seattle"     "Minneapolis"

对于第二个数据框,对第一个正则表达式的微调将起作用

gsub("(.*) [(].*", "\\1", names(df2)[-1])
# [1] "New York"    "Phoenix"     "Chicago"     "Los Angeles"

将这两者合二为一组:

nms <- c(names(df1)[-1], names(df2)[-1])
gsub("(.*[)] |)(.*) [(].*", "\\2", nms)
# [1] "New York"    "Chicago"     "Denver"      "Seattle"     "Minneapolis"
# [6] "New York"    "Phoenix"     "Chicago"     "Los Angeles"