提取由最后一个逗号分隔的最后两个单词串

时间:2018-04-20 03:30:15

标签: r regex

考虑包含df列的数据框location

df <- structure(
  list(location = c("International Society for Horticultural Science (ISHS), Leuven, Belgium",
                    "International Society for Horticultural Science (ISHS), Leuven, Belgium",
                    "White House, Jodhpur, India", "Crop Science Society of the Philippines, College, Philippines",
                    "Crop Science Society of the Philippines, College, Philippines",
                    "Institute of Forest Science, Kangwon National University, Kangwon, Korea Republic")), 
  .Names = "location", 
  row.names = c(NA, -6L), 
  class = "data.frame")

我试图从address中提取locationaddress应包含由最后一个逗号分隔的单词串。我该怎么做呢?我一直在努力学习正则表达式,但我的知识并不符合标准。这是我试过的:

library(tidyverse)
df %>% mutate(address = str_extract(location, "[:alpha:]+$")) %>% select(address)

此输出

#       address
# 1     Belgium
# 2     Belgium
# 3       India
# 4 Philippines
# 5 Philippines
# 6    Republic

这是我想要的输出:

#                   address
# 1         Leuven, Belgium
# 2         Leuven, Belgium
# 3          Jodhpur, India
# 4    College, Philippines
# 5    College, Philippines
# 6 Kangwon, Korea Republic

3 个答案:

答案 0 :(得分:1)

url: "/ProfesorCurso/ConseguirCursosProfesor",

这可能有效

答案 1 :(得分:1)

就像你一样,即使我对正则表达式的了解也达不到标准。因此,在试图找出在正则表达式中执行此操作的不同方法之后,我放弃并使用传统方法。

sapply(strsplit(df$location, ","), function(x) paste0(tail(x, 2), collapse = ","))

#[1] " Leuven, Belgium"         " Leuven, Belgium"        
#[3] " Jodhpur, India"          " College, Philippines"   
#[5] " College, Philippines"    " Kangwon, Korea Republic"

我们将location拆分为&#34;,&#34;并使用tailpaste以及&#34;,&#34;选择最后两个实例再次获得所需的输出。

我终于有时间让正则表达式工作了。

library(stringi)
stri_extract(df$location, regex = "[^,]+,[^,]+$")

#[1] " Leuven, Belgium"         " Leuven, Belgium"        
#[3] " Jodhpur, India"          " College, Philippines"   
#[5] " College, Philippines"    " Kangwon, Korea Republic"

答案 2 :(得分:1)

这有效:

df %>%
  mutate(address = str_extract(location, "([[:alpha:]]+ ?)+, ([[:alpha:]]+ ?)+$"))

模式[[:alpha:]]+ ?匹配一串字母,后跟一个空格。将它包装在括号中,然后用+来查找至少出现一次的整个模式。

相关问题