Question

我的数据框如下。这是一个具有统一外观模式的样本集数据，但整个数据不是很均匀：

locationid      address     
1073744023  525 East 68th Street, New York, NY      10065, USA
1073744022  270 Park Avenue, New York, NY 10017, USA      
1073744025  Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA 
1073744024  1251 Avenue of the Americas, New York, NY 10020, USA
1073744021  1301 Avenue of the Americas, New York, NY 10019, USA 
1073744026  44 West 45th Street, New York, NY 10036, USA

我需要从这个地址找到城市和国家名称。我尝试了以下方法：

1） strsplit 这给了我一个列表，但是我无法从中访问最后一个或第三个元素。

2）正则表达式 找国家很容易

str_sub(str_extract(address, "\\d{5},\\s.*"),8,11)

但是对于城市

str_sub(str_extract(address, ",\\s.+,\\s.+\\d{5}"),3,comma_pos)

我找不到comma_pos因为它再次引发了我同样的问题。我相信有一种更有效的方法可以使用上述任何一种解决方法来解决这个问题。

Answer 1

试试这段代码：

library(gsubfn)

cn <- c("Id", "Address", "City", "State", "Zip", "Country")

pat <- "(\\d+) (.+), (.+), (..) (\\d+), (.+)"
read.pattern(text = Lines, pattern = pat, col.names = cn, as.is = TRUE)

提供以下data.frame，从中可轻松选择组件：

          Id                                  Address     City State   Zip Country
1 1073744023                     525 East 68th Street New York    NY 10065     USA
2 1073744022                          270 Park Avenue New York    NY 10017     USA
3 1073744025 Rockefeller Center, 50 Rockefeller Plaza New York    NY 10020     USA
4 1073744024              1251 Avenue of the Americas New York    NY 10020     USA
5 1073744021              1301 Avenue of the Americas New York    NY 10019     USA
6 1073744026                      44 West 45th Street New York    NY 10036     USA

解释它使用此模式（在引号内，反斜杠必须加倍）：

(\d+) (.+), (.+), (..) (\d+), (.+)

通过以下debuggex铁路图可视化 - 有关详细信息，请参阅此Debuggex Demo：

Regular expression visualization

并用以下文字解释：

"(\\d+)" - 一个或多个数字（代表Id）后跟
" "后跟
"(.+)" - 任何非空字符串（代表Address）后跟
", " - 逗号和空格后跟
"(.+)" - 任何非空字符串（代表City）后跟
", " - 逗号和空格后跟
"(..)" - 两个字符（代表State）后跟
" " - 后跟
"(\\d+)" - 一个或多个数字（代表Zip）后跟
", " - 逗号和空格后跟
"(.+)" - 任何非空字符串（代表Country）

它起作用，因为正则表达式是贪婪的，总是试图找到最长的字符串，每次正则表达式的后续部分无法匹配时，它可以匹配回溯。

这个appraoch的优点是正则表达式非常简单直接，整个代码非常简洁，因为一个read.pattern语句可以完成所有这些：

注意：我们将此用于Lines：

Lines <- "1073744023 525 East 68th Street, New York, NY 10065, USA
1073744022 270 Park Avenue, New York, NY 10017, USA
1073744025 Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA
1073744024 1251 Avenue of the Americas, New York, NY 10020, USA
1073744021 1301 Avenue of the Americas, New York, NY 10019, USA
1073744026 44 West 45th Street, New York, NY 10036, USA"

Answer 2

拆分数据

 ss <- strsplit(data,",")`

然后

n <- sapply(s,len)

将给出元素的数量（这样你就可以向后工作）。然后

mapply(ss,"[[",n)

为您提供最后一个元素。或者你可以做到

sapply(ss,tail,1)

获取最后一个元素。

要获得倒数第二个（或更普遍）你需要的

sapply(ss,function(x) tail(x,2)[1])

Answer 3

这是一种使用 tidyr 包的方法。就个人而言，我只是使用 tidyr 包extract将整个事物分成所有各种元素。这使用正则表达式，但方式与您要求的不同。

library(tidyr)

extract(x, address, c("address", "city", "state", "zip", "state"), 
    "([^,]+),\\s([^,]+),\\s+([A-Z]+)\\s+(\\d+),\\s+([A-Z]+)")

##   locationid                       address     city state   zip state
## 1 1073744023          525 East 68th Street New York    NY 10065   USA
## 2 1073744022               270 Park Avenue New York    NY 10017   USA
## 3 1073744025          50 Rockefeller Plaza New York    NY 10020   USA
## 4 1073744024   1251 Avenue of the Americas New York    NY 10020   USA
## 5 1073744021   1301 Avenue of the Americas New York    NY 10019   USA
## 6 1073744026           44 West 45th Street New York    NY 10036   USA

她是从http://www.regexper.com/获取的正则表达式的直观解释：

enter image description here

Answer 4

我想你想要这样的东西。

> x <- "1073744026 44 West 45th Street, New York, NY 10036, USA"
> regmatches(x, gregexpr('^[^,]+, *\\K[^,]+', x, perl=T))[[1]]
[1] "New York"
> regmatches(x, gregexpr('^[^,]+, *[^,]+, *[^,]+, *\\K[^\n,]+', x, perl=T))[[1]]
[1] "USA"

正则表达式解释：

^断言我们刚开始。
[^,]+匹配任何字符，但不匹配,一次或多次。如果您的数据框包含空字段，请将其更改为[^,]*。
,匹配文字,
<space>*匹配零个或多个空格。
\K会丢弃以前匹配的字符进行打印。与\K后面的模式匹配的字符将显示为输出。

Answer 5

这种模式怎么样：

,\s(?<city>[^,]+?),\s(?<shortCity>[^,]+?)(?i:\d{5},)(?<country>\s.*)

此模式将与这三个组匹配：

＆＃34; group＆＃34;：＆＃34; city＆＃34;，＆＃34; value＆＃34;：＆＃34; New York＆＃34;
＆＃34; group＆＃34;：＆＃34; shortCity＆＃34;，＆＃34; value＆＃34;：＆＃34; NY＆＃34;
＆＃34; group＆＃34;：＆＃34; country＆＃34;，＆＃34; value＆＃34;：＆＃34;美国＆＃34;

Answer 6

使用rex构造正则表达式可能会使这种类型的任务变得更简单。

x <- data.frame(
  locationid = c(
    1073744023,
    1073744022,
    1073744025,
    1073744024,
    1073744021,
    1073744026
    ),
  address = c(
    '525 East 68th Street, New York, NY      10065, USA',
    '270 Park Avenue, New York, NY 10017, USA',
    'Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA',
    '1251 Avenue of the Americas, New York, NY 10020, USA',
    '1301 Avenue of the Americas, New York, NY 10019, USA',
    '44 West 45th Street, New York, NY 10036, USA'
    ))

library(rex)

sep <- rex(",", spaces)

re <-
  rex(
    capture(name = "address",
      except_some_of(",")
    ),
    sep,
    capture(name = "city",
      except_some_of(",")
    ),
    sep,
    capture(name = "state",
      uppers
    ),
    spaces,
    capture(name = "zip",
      some_of(digit, "-")
    ),
    sep,
    capture(name = "country",
      something
    ))

re_matches(x$address, re)
#>                      address     city state   zip country
#>1        525 East 68th Street New York    NY 10065     USA
#>2             270 Park Avenue New York    NY 10017     USA
#>3        50 Rockefeller Plaza New York    NY 10020     USA
#>4 1251 Avenue of the Americas New York    NY 10020     USA
#>5 1301 Avenue of the Americas New York    NY 10019     USA
#>6         44 West 45th Street New York    NY 10036     USA

此正则表达式还将处理9位邮政编码（12345-1234）和美国以外的国家。

提取逗号分隔的字符串

6 个答案: