Question

我已经获得了一组国家/地区组，我试图获得一组相互排斥的区域，以便我可以对它们进行比较。问题是我的数据包含几个组，其中许多组重叠。如何获得一组包含所有国家/地区但又不相互重叠的组？

例如，假设这是世界上的国家/地区列表：

World <- c("Angola", "France", "Germany", "Australia", "New Zealand")

假设这是我的一组：

df <- data.frame(group = c("Africa", "Western Europe", "Europe", "Europe", "Oceania", "Oceania", "Commonwealth Countries"), 
           element = c("Angola", "France", "Germany", "France", "Australia", "New Zealand", "Australia"))

                   group     element
1                 Africa      Angola
2         Western Europe      France
3                 Europe     Germany
4                 Europe      France
5                Oceania   Australia
6                Oceania New Zealand
7 Commonwealth Countries   Australia

如何删除重叠的组（在本例中为西欧）以获取包含以下所有国家/地区的一组组：

df_solved <- data.frame(group = c("Africa", "Europe", "Europe", "Oceania", "Oceania"),
                        element = c("Angola", "France", "Germany", "Australia", "New Zealand"))

    group     element
1  Africa      Angola
2  Europe      France
3  Europe     Germany
4 Oceania   Australia
5 Oceania New Zealand

Answer 1

一个可能的规则可能是最小化组的数量，例如将元素与包含最多元素的组相关联。

library(data.table)
setDT(df)[, n.elements := .N, by = group][
  order(-n.elements), .(group = group[1L]), by = element]

       element   group
1:     Germany  Europe
2:      France  Europe
3:   Australia Oceania
4: New Zealand Oceania
5:      Angola  Africa

解释

setDT(df)[, n.elements := .N, by = group][]

返回

                    group     element n.elements
1:                 Africa      Angola          1
2:         Western Europe      France          1
3:                 Europe     Germany          2
4:                 Europe      France          2
5:                Oceania   Australia          2
6:                Oceania New Zealand          2
7: Commonwealth Countries   Australia          1

现在，通过减少元素数量来排序行，并且对于每个国家，选择第一个，即“最大”的组。这应按要求为每个国家/地区返回一个组。如果是关系，即一个组包含相同数量的元素，您可以在订购时添加额外的citeria，例如，组名的长度，或只是按字母顺序排列。

Answer 2

1）如果您想简单地消除重复元素，请使用!duplicated(...)，如图所示。没有包使用。

subset(df, !duplicated(element))

，并提供：

    group     element
1  Africa      Angola
2  Europe      France
3  Europe     Germany
5 Oceania   Australia
6 Oceania New Zealand

2）设置分区如果每个组必须完全进入或完全出去，并且每个元素只能出现一次，则这是一个设置分区问题：

library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, "=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))

，并提供：

    group     element
1  Africa      Angola
2  Europe      France
3  Europe     Germany
5 Oceania   Australia
6 Oceania New Zealand

3）设置覆盖当然可能没有确切的设置分区，所以我们可以考虑设置覆盖问题（在lp行中，相同的代码“=”被“＆gt; =”替换

library(lpSolve)

const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, ">=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))

，并提供：

    group     element
1  Africa      Angola
2  Europe      France
3  Europe     Germany
5 Oceania   Australia
6 Oceania New Zealand

然后我们可以选择应用（1）删除封面中的任何重复项。

4）非支配组另一种方法是删除其元素构成其他组元素的严格子集的任何组。例如，西欧的每个元素都在欧洲，而欧洲的元素比西欧更多，因此西欧的元素是欧洲元素的严格子集，我们将西欧移除。使用上面的const.mat：

# returns TRUE if jth column of const.mat is dominated by some other column
is_dom_fun <- function(j) any(apply(const.mat[, j] <= const.mat[, -j], 2, all) & 
                            sum(const.mat[, j]) < colSums(const.mat[, -j]))

is_dom <- sapply(seq_len(ncol(const.mat)), is_dom_fun) 
subset(df, group %in% colnames(const.mat)[!is_dom])

，并提供：

    group     element
1  Africa      Angola
3  Europe     Germany
4  Europe      France
5 Oceania   Australia
6 Oceania New Zealand

如果有任何重复，我们可以使用（1）删除它们。

Answer 3

library(dplyr)
df %>% distinct(element, .keep_all=TRUE)

    group     element
1  Africa      Angola
2  Europe      France
3  Europe     Germany
4 Oceania   Australia
5 Oceania New Zealand

向Axeman致敬，用这个答案击败我。

<强>更新

你的问题不明确。为什么'欧洲'比'西欧'更受欢迎？换句话说，每个国家都分配了几个小组。您希望将其减少到每个国家/地区的一个组。你如何决定哪个组？

这是一种方式，我们总是喜欢最大的：

groups <- df %>% count(group)
df %>% inner_join(groups, by='group') %>%
  arrange(desc(n)) %>% distinct(elemenet, .keep_all=TRUE)

    group     element n
1  Europe      France 2
2  Europe     Germany 2
3 Oceania   Australia 2
4 Oceania New Zealand 2
5  Africa      Angola 1

Answer 4

以下是data.table

的一个选项

library(data.table)
setDT(df)[, head(.SD, 1), element]

或unique

unique(setDT(df), by = 'element')
#    group     element
#1:  Africa      Angola
#2:  Europe      France
#3:  Europe     Germany
#4: Oceania   Australia
#5: Oceania New Zealand

使用了包，它是data.table

Answer 5

完全不同的方法是忽略给定的群组，但只查找联合国地区目录中的国家/地区名称，这些国家/地区名称位于countrycodes或ISOcodes包中。

countrycodes软件包似乎提供了更简单的界面，并且还警告了在其数据库中找不到的国家/地区名称：

# given country names - note the deliberately misspelled last entry 
World <- c("Angola", "France", "Germany", "Australia", "New Zealand", "New Sealand")
# regions
countrycode::countrycode(World, "country.name.en", "region")

[1] "Middle Africa"             "Western Europe"            "Western Europe"            "Australia and New Zealand"
[5] "Australia and New Zealand" NA                         
Warning message:
In countrycode::countrycode(World, "country.name.en", "region") :
  Some values were not matched unambiguously: New Sealand

# continents
countrycode::countrycode(World, "country.name.en", "continent")

[1] "Africa"  "Europe"  "Europe"  "Oceania" "Oceania" NA       
Warning message:
In countrycode::countrycode(World, "country.name.en", "continent") :
  Some values were not matched unambiguously: New Sealand

查找包含所有元素但不重叠的组

5 个答案:

解释