使用非唯一概率密钥合并数据帧

时间:2015-12-15 17:11:37

标签: r merge

目标是将df2合并到df1,其中df2中的关键值不是唯一的,但是在每个都具有概率值的组中。一个简单的例子:

df1
#    key
#1     A
#2     B
#3     C
#4     C
#5     A
#6     A
#7     D

df2
#     key  code prob
#1      A     1 0.75
#2      A     2 0.25
#3      B     1 0.95
#4      B     2 0.05
#5      C     1 0.20
#6      C     2 0.25
#7      C     3 0.55
#8      D     1 0.33
#9      D     2 0.33
#10     D     3 0.33

预期结果类似于以下code根据df2中的概率分配# key code #1 A 1 #2 B 1 #3 C 3 #4 C 3 #5 A 2 #6 A 1 #7 D 2

df1 <- structure(list(key = structure(c(1L, 2L, 3L, 3L, 1L, 1L, 4L), .Label = c("A", 
"B", "C", "D"), class = "factor")), .Names = "key", class = "data.frame", row.names = c(NA, 
-7L))

df2 <- structure(list(key = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 
4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), 
    code = c(1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 3L), prob = c(0.75, 
    0.25, 0.95, 0.05, 0.2, 0.25, 0.55, 0.33, 0.33, 0.33)), .Names = c("key", 
"code", "prob"), class = "data.frame", row.names = c(NA, -10L
))

数据:

{{1}}

3 个答案:

答案 0 :(得分:2)

我很确定你只是想要:

library(dplyr)

df2 %>%
  group_by(key) %>%
  sample_n(1, weight = prob) %>%
  right_join(df1)

答案 1 :(得分:2)

apply中的每一行使用df1,对df2中的可用代码进行抽样,加权prob,以获取key的当前值:

df1$code = apply(df1, 1, function(x) {
  sample(df2$code[df2$key==x["key"]], 1, prob=df2$prob[df2$key==x["key"]])
})

答案 2 :(得分:1)

我认为这就是你想要的。

library(dplyr)
df1$id <- seq(nrow(df1))
df3 <- merge(df1, df2, by = "key", all.x = TRUE)
df3 %>% group_by(id) %>% sample_n(1, weight = prob)

我为df1生成了id变量,并将df1与df2中的所有可能代码合并。然后,dplyr::sample_n为每个ID提供加权采样。 典型的结果将是

Source: local data frame [7 x 4]
Groups: id

  key id code prob
1   A  1    1 0.75
2   B  2    1 0.95
3   C  3    3 0.55
4   C  4    3 0.55
5   A  5    1 0.75
6   A  6    1 0.75
7   D  7    1 0.33