比较两个数据集并通过查看两个数据集的并集来创建新的数据框

时间:2018-03-08 00:50:12

标签: r

我有两个数据帧df1和df2,我想通过查看两个数据集的并集来创建一个新的数据帧。如果特定列在两个数据集中的值均为1,则新数据集的值应为该特定列的值1。

    df1 = data.frame( V1 = letters[1:5], V2 = c("0","1","1","0","1"), V3 = c("0","0","0","0","1"), V4 =c("1","1","1","1","1"), V5 = c("0","0","0","0","1"),V6 =c("1","1","1","0","0"))

    df2 = data.frame( V1 = letters[1:5], V2 = c("1","1","1","0","0"), V3 = c("1","0","0","0","1"), V4 =c("0","0","1","0","1"), V5 = c("1","0","0","0","1"))

   result = data.frame( V1 = letters[1:5], V2 = c("1","1","1","0","1"), V3 = c("1","0","0","0","1"), V4 =c("1","1","1","1","1"), V5 = c("1","0","0","0","1"),V6 =c("1","1","1","0","0"))

1 个答案:

答案 0 :(得分:3)

这是我的第一次尝试;虽然我确信这可以改进:

library(tidyverse)

set.seed(345)

df1 <- tibble(
  V1 = letters[1:5],
  V2 = sample(c(0,1), 5, replace = TRUE),
  V3 = sample(c(0,1), 5, replace = TRUE)
)

df2 <- tibble(
  V1 = letters[1:5],
  V2 = sample(c(0,1), 5, replace = TRUE),
  V3 = sample(c(0,1), 5, replace = TRUE)
)

df1

# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     0     0
3     c     0     1
4     d     1     0
5     e     0     0

df2

# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     0
2     b     1     1
3     c     0     0
4     d     1     1
5     e     1     1

解决方案草案:

result <- df1 %>% 
  left_join(df2, by = "V1") %>% 
  rowwise() %>% 
  mutate(
    V2 = max(V2.x, V2.y),
    V3 = max(V3.x, V3.y)
  ) %>% 
  select(V1, V2, V3)

result

# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     1     1
3     c     0     1
4     d     1     1
5     e     1     1

显然,如果你有大量的变量,这将成为一个不太理想的答案。

<强>更新

以下是如何使解决方案对于任意数量的列更加通用:

df1 %>% 
  select(V1) %>% 
  bind_cols( 
    map2_df(
      .x = df1[-1],
      .y = df2[-1], 
      .f = ~ map2_dbl(.x, .y, max)
    )
  )
# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     1     1
3     c     0     1
4     d     1     1
5     e     1     1

这就是它的工作原理:

我们可以为map2_dbl提供一对向量,并在每个位置找到两个向量的最大值,如下所示:

map2_dbl(
  .x = c(0, 0, 0, 1, 0), 
  .y = c(0, 1, 0, 1, 1), 
  .f = max
)

[1] 0 1 0 1 1

这将成为解决方案的最内部部分。现在,我们只需要弄清楚如何迭代地将所有变量对从两个数据帧传递到上面的map2_dbl。这个愚蠢的例子说明了它的工作原理:

map2(
  .x = df1[-1], 
  .y = df2[-1], 
  .f = function(x = .x, y = .y) {
    cat("x = ", x, "y = ", y, "\n")
  }
)

x =  0 0 0 1 0 y =  0 1 0 1 1 
x =  1 0 1 0 0 y =  0 1 0 1 1 
$V2
NULL

$V3
NULL

请注意,在第一遍x = df1 $ V2和y = df2 $ V2。在第二次迭代中,x = df1 $ V3,y = df2 $ V3。这正是我们想要的。

我们可以使用三个步骤来获得最终解决方案:

x1 <- df1 %>% 
  select(V1)

x2 <- map2_df(
  .x = df1[-1], 
  .y = df2[-1], 
  .f = function(x = .x, y = .y) {
    map2_dbl(x, y, max)
  }
)

bind_cols(x1, x2)

# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     1     1
3     c     0     1
4     d     1     1
5     e     1     1

或者,我们可以将这些步骤合并到一个管道中:

df1 %>% 
  select(V1) %>% 
  bind_cols( 
    map2_df(
      .x = df1[-1],
      .y = df2[-1], 
      .f = ~ map2_dbl(.x, .y, max)
    )
  )
# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     1     1
3     c     0     1
4     d     1     1
5     e     1     1