正则表达分裂城市,州

时间:2014-08-14 03:17:29

标签: regex r

我有一个数据框中的城市,州数据列表。 我只需要提取状态缩写并存储到名为state的新变量列中。从视觉检查来看,状态始终是字符串中的最后两个字符,它们都是大写的。这个城市,州的数据如下所示:

test <- c("Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA")

我尝试了以下

pattern <- "[, (A-Z){2}]"
strsplit(test, pattern)

输出结果为:

[[1]]
[1] "Anchorage, "

[[2]]
[1] "New York City, "

[[3]]
[1] "Some Place, Another Place, "

EDI: 我使用了另一个常规表达:

pattern2 <- "([a-z, ])"
sp <- strsplit(test, pattern2)

我得到了这些结果:

[[1]]
 [1] "A"  ""   ""   ""   ""   ""   ""   ""   ""   ""   "AK"

[[2]]
 [1] "N"  ""   ""   "Y"  ""   ""   ""   "C"  ""   ""   ""   ""   "NY"

[[3]]
 [1] "S"  ""   ""   ""   "P"  ""   ""   ""   ""   ""   "A"  ""   ""   ""   ""   ""   ""  
[18] "P"  ""   ""   ""   ""   ""   "LA"

所以,缩写就在那里,但是当我尝试使用sapply()进行提取时,我不知道如何获取列表的最后一个元素。我知道如何获得第一个:

sapply(sp, "[[", 1)

6 个答案:

答案 0 :(得分:4)

我不确定你真的需要一个正则表达式。如果您总是只想要字符串的最后两个字符,请使用

substring(test, nchar(test)-1, nchar(test))
[1] "AK" "NY" "LA"

如果你真的坚持使用正则表达式,至少考虑使用regexec而不是strsplit,因为你对分裂并不感兴趣,你只想提取状态。

m <- regexec("[A-Z]+$", test)
unlist(regmatches(test,m))
# [1] "AK" "NY" "LA"

答案 1 :(得分:1)

这可行:

regmatches(test, gregexpr("(?<=[,][\\s+])([A-Z]{2})", test, perl = TRUE))

## [[1]]
## [1] "AK"
## 
## [[2]]
## [1] "NY"
## 
## [[3]]
## [1] "LA"

解释赞美:http://liveforfaith.com/re/explain.pl

(?<=                     look behind to see if there is:
  [,]                      any character of: ','
  [\\s+]                    any character of: whitespace (\n, \r,
                           \t, \f, and " "), '+'
)                        end of look-behind
(                        group and capture to \1:
  [A-Z]{2}                 any character of: 'A' to 'Z' (2 times)
)                        end of \1

答案 2 :(得分:1)

尝试:

tt = strsplit(test, ', ')

tt
[[1]]
[1] "Anchorage" "AK"      

[[2]]
[1] "New York City" "NY"          

[[3]]
[1] "Some Place"     "Another Place" "LA"           


z = list()

for(i in tt) z[length(z)+1] = i[length(i)]


z
[[1]]
[1] "AK"

[[2]]
[1] "NY"

[[3]]
[1] "LA"

答案 3 :(得分:0)

我认为你反过来理解'[]'和'()'的含义。 '()'表示匹配一组字符; '[]'表示匹配类中的任何一个字符。你需要的是

“(,[[ - Z] {2})”。

答案 4 :(得分:0)

 library(stringr)
 str_extract(test, perl('[A-Z]+(?=\\b$)'))
 #[1] "AK" "NY" "LA"

答案 5 :(得分:0)

这是同一个

的正则表达式

<强>正则表达式

(?'state'\w{2})(?=")

测试字符串

"Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA"

<强>结果

  • 比赛1
    • 州[12-14] AK
  • 比赛2
    • 州[33-35] NY
  • 比赛3
    • 州[66-68] LA

<强> live demo here

如果需要,您可以删除指定的捕获以使其更小

例如

(\w{2})(?=")