将国家,城市分组为R国

时间:2015-07-30 23:46:44

标签: r

在R中,我运行了一个代码来获取包含城市,国家和相应数字的两列数据框。

我在列上运行summary()并将结果转换为数据框。

我正在努力将所有州合并为一个国家。例如,在下面的输出中,我想将所有美国州,城市组合成一个国家“美国”。我可以使用grep()查找模式,然后使用一些包来组合在一起吗?请告知如何做到这一点。

location<-summary(pind$userLocation)
location<-as.data.frame(location)
location

数据:

                     location
                       271286
null                    58145
Texas                    1027
United States             900
USA                       866
Paris                     755
California                590
Canada                    535
Florida                   438
New York                  392
Australia                 379
London                    375
Ohio                      373
Michigan                  354
Chicago, IL               335
Los Angeles, CA           323
Chicago                   299
Colorado                  275
New York, NY              275
North Carolina            271
Minnesota                 259
Seattle, WA               254
Los Angeles               249
Indiana                   247
Virginia                  244
Wisconsin                 231
Arizona                   224
Atlanta, GA               221
Dallas, TX                220
Oregon                    218
Georgia                   204
Houston, TX               200
Oklahoma                  200
Utah                      198
Austin, TX                190
Pennsylvania              189
Illinois                  187
San Diego, CA             184
Tennessee                 182
UK                        182
Missouri                  181
Kentucky                  173
San Francisco, CA         172
Louisiana                 167
NYC                       167
Alabama                   163
Nashville, TN             157
Iowa                      149
Boston, MA                148
Kansas                    145
Southern California       144
Denver, CO                142
New Jersey                140
Sydney, Australia         138
South Carolina            134
Washington, DC            133
Maryland                  128
Arkansas                  127
Portland, OR              126
Phoenix, AZ               125
Atlanta                   124
London, UK                124
Melbourne, Australia      123
Ontario, Canada           121
Seattle                   121
Washington                121
Las Vegas, NV             116
New Zealand               116
United Kingdom            116
Brooklyn, NY              115
CA                        110
Minneapolis, MN           109
Houston, Texas            105
NC                        104
New York City             103
Toronto                   103
Austin, Texas             101
Charlotte, NC             101
South Africa              100
Pittsburgh, PA             98
San Francisco              98
Vancouver, BC              95
Germany                    94
Phoenix, Arizona           92
Barcelona                  89
Dallas, Texas              89
Portland, Oregon           89
England                    88
Idaho                      86
.                          83
San Diego                  83
West Virginia              83
Nevada                     82
The Netherlands            81
France                     79
Raleigh, NC                78
Kansas City, MO            76
Massachusetts              75
US                         75

2 个答案:

答案 0 :(得分:2)

由于您的数据并不广泛,因此可以非常轻松地手动完成。我浏览了每条记录并确定了它所属的国家/地区,并添加了一个包含结果的新列。拥有国家/地区后,您可以使用aggregate()获取总和:

location <- data.frame(location=c(271286,58145,1027,900,866,755,590,535,438,392,379,375,373,354,335,323,299,275,275,271,259,254,249,247,244,231,224,221,220,218,204,200,200,198,190,189,187,184,182,182,181,173,172,167,167,163,157,149,148,145,144,142,140,138,134,133,128,127,126,125,124,124,123,121,121,121,116,116,116,115,110,109,105,104,103,103,101,101,100,98,98,95,94,92,89,89,89,88,86,83,83,83,82,81,79,78,76,75,75),row.names=c('','null','Texas','United States','USA','Paris','California','Canada','Florida','New York','Australia','London','Ohio','Michigan','Chicago, IL','Los Angeles, CA','Chicago','Colorado','New York, NY','North Carolina','Minnesota','Seattle, WA','Los Angeles','Indiana','Virginia','Wisconsin','Arizona','Atlanta, GA','Dallas, TX','Oregon','Georgia','Houston, TX','Oklahoma','Utah','Austin, TX','Pennsylvania','Illinois','San Diego, CA','Tennessee','UK','Missouri','Kentucky','San Francisco, CA','Louisiana','NYC','Alabama','Nashville, TN','Iowa','Boston, MA','Kansas','Southern California','Denver, CO','New Jersey','Sydney, Australia','South Carolina','Washington, DC','Maryland','Arkansas','Portland, OR','Phoenix, AZ','Atlanta','London, UK','Melbourne, Australia','Ontario, Canada','Seattle','Washington','Las Vegas, NV','New Zealand','United Kingdom','Brooklyn, NY','CA','Minneapolis, MN','Houston, Texas','NC','New York City','Toronto','Austin, Texas','Charlotte, NC','South Africa','Pittsburgh, PA','San Francisco','Vancouver, BC','Germany','Phoenix, Arizona','Barcelona','Dallas, Texas','Portland, Oregon','England','Idaho','.','San Diego','West Virginia','Nevada','The Netherlands','France','Raleigh, NC','Kansas City, MO','Massachusetts','US'));
location$country <- factor(c(NA,NA,'United States','United States','United States','France','United States','Canada','United States','United States','Australia','United Kingdom','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United Kingdom','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','Australia','United States','United States','United States','United States','United States','United States','United States','United Kingdom','Australia','Canada','United States','United States','United States','New Zealand','United Kingdom','United States','Canada','United States','United States','United States','United States','Canada','United States','United States','South Africa','United States','United States','Canada','Germany','United States','Spain','United States','United States','United Kingdom','United States',NA,'United States','United States','United States','Netherlands','France','United States','United States','United States','United States'));
aggregate(location~country,location,sum);
##           country location
## 1       Australia      640
## 2          Canada      964
## 3          France      834
## 4         Germany       94
## 5     Netherlands       81
## 6     New Zealand      116
## 7    South Africa      100
## 8           Spain       89
## 9  United Kingdom      885
## 10  United States    15964

我使用NA,仅凭位置名称无法确定国家/地区;我引用了名为'''null''.'的三条记录。由于aggregate()忽略组值为NA的记录,因此这些记录不包含在结果中。

答案 1 :(得分:0)

我不确定我是否理解这个问题,但我会试一试。

您希望为每个位置字符串标识它所属的国家/地区,然后将它们组合在一起并根据国家/地区组进行操作?

如果是这种情况,那么我们想到的是使用ggmap中使用谷歌地图API的地理编码功能,这只有在您没有进行过多次查询时才有意义。

require(dplyr)
require(ggmap)

MyGeoCode <- function(Location){
  return(geocode(Location,output = "more")$country)
}

location$country <- sapply(location$location,MyGeoCode)

 location <- location %>% group_by(country) %>% summarise(TotalPerCountry=sum(numbercolumn,na.rm = TRUE))

此示例假设您要对每个国家/地区的数字列求和,其他操作也遵循相同的方式。