匹配列或NA列上的左连接数据表

时间:2019-06-12 14:53:04

标签: r data.table

我有许多需要连接的表。但是,在某些单元格中,该值为NA,需要与每个可能的值匹配。

在SQL中,可能类似于:

SELECT * FROM A
LEFT JOIN B
ON (A.KEY1 = B.KEY1 OR B.KEY1 IS NULL)
AND (A.KEY2 = B.KEY2 OR B.KEY2 IS NULL) # Repeated for every other column

我可以通过执行许多联接来解决此问题,例如:

B[A, on = .(Key1, Key2, Key3), Var = i.Var]
B[A[is.na(Key2), ], on = .(Key1, Key3), Var = i.Var]
B[A[is.na(Key3), ], on = .(Key1, Key2), Var = i.Var]
B[A[is.na(Key2) & is.na(Key3), ], on = .(Key1), Var = i.Var]
B[A[is.na(Key1), ], on = .(Key2, Key3), Var = i.Var]
B[A[is.na(Key1) & is.na(Key2), ], on = .(Key3), Var = i.Var]
B[A[is.na(Key1) & is.na(Key3), ], on = .(Key2), Var = i.Var]

但是,这似乎不是最好的方法,尤其是随着列数的增加。上面仅需要3列就需要7个更新联接。

例如,如果我有一张桌子,该桌子的名字与某人的描述(他们居住的城市,头发的颜色,身高)相匹配:

观察到的数据:

a <- data.table(id = c(1, 2, 3),
            city = c("city1", "city2", "city2"),
            height = c("tall", "tall", "short"),
            hair = c("black", "black", "blonde"))
       id  city height   hair    name
    1:  1 city1   tall  black    dave
    2:  2 city2   tall  black william
    3:  3 city2  short blonde    jack

要匹配的表:

b <- data.table(city = c("city1", "city1", "city2", "city2"),
            height = c("tall", "tall", "short", "tall"),
            hair = c("black", "blonde", "blonde", "black"),
            name = c("dave", "harry", "jack", "william"))
    city height   hair    name
1: city1   tall  black    dave
2: city1   tall blonde   harry
3: city2  short blonde    jack
4: city2   tall  black william

加入他们:

b[a, on = .(city, height, hair), .(id, city, height, hair, name)]
       id  city height   hair    name
    1:  1 city1   tall  black    dave
    2:  2 city2   tall  black william
    3:  3 city2  short blonde    jack

这是预期的。我需要它,以便某些字段丢失,例如:

        city height   hair    name
    1: city1     NA  black    dave
    2: city1     NA blonde   harry
    3: city2  short     NA    jack
    4: city2   tall  black william

它仍然应该产生相同的输出

在data.table框架内是否有有效的方法?

谢谢

编辑:

为了更清楚一点,如果表b为

    b <- data.table(city = c("city1", "city1", "city2", "city2"),
                    height = c(NA, "tall", "short", "tall"),
                    hair = c("black", "blonde", "blonde", "black"),
                    name = c("dave", "harry", "jack", "william"))

然后该联接仅产生:

       id  city height   hair    name
    1:  1 city1   tall  black      NA
    2:  2 city2   tall  black william
    3:  3 city2  short blonde    jack

何时应产生:

       id  city height   hair    name
    1:  1 city1   tall  black    dave
    2:  2 city2   tall  black william
    3:  3 city2  short blonde    jack

NA应与任何值匹配的“通配符”对待。

EDIT2:

我发现的第二种解决方法是通过笛卡尔先连接表:

    ab <- a[, as.list(b), by = .(id, i.city = city, i.height = height, i.hair)]

       id i.city i.height i.hair  city height   hair    NAME
     1:  1  city1     tall  black city1     NA  black    dave
     2:  1  city1     tall  black city1   tall blonde   harry
     3:  1  city1     tall  black city2  short blonde    jack
     4:  1  city1     tall  black city2   tall  black william
     5:  2  city2     tall  black city1     NA  black    dave
     6:  2  city2     tall  black city1   tall blonde   harry
     7:  2  city2     tall  black city2  short blonde    jack
     8:  2  city2     tall  black city2   tall  black william
     9:  3  city2    short blonde city1     NA  black    dave
    10:  3  city2    short blonde city1   tall blonde   harry
    11:  3  city2    short blonde city2  short blonde    jack
    12:  3  city2    short blonde city2   tall  black william

然后在以下条件下应用我的条件:

    ab[(i.city == city | is.na(city)) 
       & (i.height == height | is.na(height)) 
       & (i.hair == hair | is.na(hair))]

     id i.city i.height i.hair  city height   hair    name
    1:  1  city1     tall  black city1     NA  black    dave
    2:  2  city2     tall  black city2   tall  black william
    3:  3  city2    short blonde city2  short blonde    jack

虽然使用大型数据集时,我不确定像这样的笛卡尔连接是否是最好的方法。

1 个答案:

答案 0 :(得分:1)

我想到的效率最低的方法是简单地扩展B,以便以后可以进行普通联接。

library(data.table)

a <- data.table(id = c(1, 2, 3),
                city = c("city1", "city2", "city2"),
                height = c("tall", "tall", "short"),
                hair = c("black", "black", "blonde"))

a_unique <- a[, lapply(.SD, function(x) { list(unique(x)) })]

b <- data.table(city = c("city1", "city1", "city2", "city2"),
                height = c(NA, "tall", "short", NA),
                hair = c("black", NA, "blonde", NA),
                name = c("dave", "harry", "jack", "william"))

harmonize <- function(mat) {
  ans <- as.vector(t(mat))
  ans[!is.na(ans)]
}

expand_recursively <- function(dt, cols) {
  if (length(cols) == 0L) return(dt)

  current <- cols[1L]
  next_cols <- cols[-1L]
  not_current <- setdiff(names(dt), current)

  na_class <- class(a_unique[[current]][[1L]])
  expanded <- data.table(as(NA, na_class), all = a_unique[[current]][[1L]])
  setnames(expanded, c(current, "all"))

  next_dt <- expanded[dt,
                      c(list(harmonize(as.matrix(.SD))), mget(not_current)),
                      on = current,
                      .SDcols = c(current, "all"),
                      allow = TRUE]

  setnames(next_dt, "V1", current)
  expand_recursively(next_dt, next_cols)
}

b_expanded <- expand_recursively(b, intersect(names(a), names(b)))
setcolorder(b_expanded, names(b))

b
    city height   hair    name
1: city1   <NA>  black    dave
2: city1   tall   <NA>   harry
3: city2  short blonde    jack
4: city2   <NA>   <NA> william

b_expanded
    city height   hair    name
1: city1   tall  black    dave
2: city1  short  black    dave
3: city1   tall  black   harry
4: city1   tall blonde   harry
5: city2  short blonde    jack
6: city2   tall  black william
7: city2   tall blonde william
8: city2  short  black william
9: city2  short blonde william

我认为有问题的事情可能是在计算a_unique。 如果您知道可以用于匹配的值, 也许您可以直接在expand_recursively中指定它们。