Question

我处于这样一种情况，我需要将两个数据框合并在一起，每个数据框包含一个关于研究主题的观察。不幸的是，数据捕获系统允许最终用户在两个屏幕上输入一些变量（例如，性别是在两个时间点捕获的，不应该改变）。没有数据库端检查来确认屏幕之间的数据是否一致，因此我们正在检查后处理。

我喜欢要做的是使用内置的R merge()函数来合并数据框，使用all=TRUE选项，这样我得到两行共享变量不匹配，然后在结果数据框中有一个单独的列，告诉我行的来源（合并中的X或Y）。尽管我可以说，merge()函数中没有类似的东西，所以我正在尝试为merge()编写自己的包装器来执行此操作。

示例：

example_df1 <- data.frame(subject_id=c(101,102,103,104,105),
                          gender=c("M","F","M","M","F"),
                          weight=c(120,130,110,114,144),
                          score=c(10,12,11,13,11))

example_df2 <- data.frame(subject_id=c(101,102,103,104,105),
                          gender=c("M","M","M","M","F"),
                          weight=c(120,130,110,117,144),
                          site1=c(13,18,23,12,4),
                          site2=c(3,7,8,11,0),
                          site3=c(31,28,12,29,40))

merge(x=example_df1,y=example_df2,all=TRUE)

  subject_id gender weight score site1 site2 site3
1        101      M    120    10    13     3    31
2        102      F    130    12    NA    NA    NA
3        102      M    130    NA    18     7    28
4        103      M    110    11    23     8    12
5        104      M    114    13    NA    NA    NA
6        104      M    117    NA    12    11    29
7        105      F    144    11     4     0    40

期望的输出：

  subject_id gender weight score site1 site2 site3 rowsource
1        101      M    120    10    13     3    31   both
2        102      F    130    12    NA    NA    NA    x
3        102      M    130    NA    18     7    28    y
4        103      M    110    11    23     8    12   both
5        104      M    114    13    NA    NA    NA    x
6        104      M    117    NA    12    11    29    y
7        105      F    144    11     4     0    40   both

由于项目周围的监管环境，我需要在没有任何特殊包的基础上实施解决方案。我最初的想法是尝试使用intersect来查找example_df1和example_df2之间的公共变量，然后以某种方式比较合并结果的每一行（在这些公共变量中） example_df1和example_df2都可以找出行的来源。这看起来真的很笨重，所以我很欣赏有关如何提高这类任务效率的建议。谢谢！

编辑添加：如果R总是在这种类型的合并中始终将X行放在Y行之上，我想这也可以起作用，但我认为我对那些比这更稳定的东西感觉更好。

Answer 1

我会在合并之前添加另一列，以简化生活：

example_df1$source <- "X" example_df2$source <- "Y" Merged <- merge(x = example_df1, y = example_df2, all = TRUE, by = c("subject_id", "gender", "weight")) Merged$rowSource <- apply(Merged[c("source.x", "source.y")], 1, function(x) paste(na.omit(x), collapse = "")) Merged # subject_id gender weight score source.x site1 site2 site3 source.y rowSource # 1 101 M 120 10 X 13 3 31 Y XY # 2 102 F 130 12 X NA NA NA <NA> X # 3 102 M 130 NA <NA> 18 7 28 Y Y # 4 103 M 110 11 X 23 8 12 Y XY # 5 104 M 114 13 X NA NA NA <NA> X # 6 104 M 117 NA <NA> 12 11 29 Y Y # 7 105 F 144 11 X 4 0 40 Y XY

从那里开始，如果您的输出中有您喜欢的内容，则应该很容易将"XY"更改为"both"，然后您可以删除＆＃34; source.x＆＃34;和＆＃34; source.y＆＃34;列....

Answer 2

这在一个合并步骤中完成所有操作，并且不会修改原始data.frames

mm<-transform(merge(
    x=cbind(example_df1,source="x"),
    y=cbind(example_df2,source="y"),
    all=TRUE, by=intersect(names(example_df1), names(example_df2))),
    source=ifelse(!is.na(source.x) & !is.na(source.y), "both", 
        ifelse(!is.na(source.x), "x", "y")),
    source.x=NULL,
    source.y=NULL
)

Answer 3

再次感谢您的回答。一旦我看到仅使用cbind()将源变量附加到数据框的解决方案，就很容易了。我写了一个简单的函数来完成它，我在这里分享。

merge_with_source <- function(x,y,name.x="X",name.y="Y") {

    # Find the variables that the two data frames have in common
    merge.names <- intersect(names(x),names(y))

    # Next, attach a column to each data frame with the chosen name
    x.df <- cbind(x,datsrc=name.x)
    y.df <- cbind(y,datsrc=name.y)

    # Create a merged data frame on the common names
    merged.df <- merge(x=x.df,
                       y=y.df,
                       all=TRUE,
                       by=merge.names)

    # Eliminate NAs from the data source column
    merged.df[is.na(merged.df$datsrc.x),"datsrc.x"] <- ""
    merged.df[is.na(merged.df$datsrc.y),"datsrc.y"] <- ""

    # Paste the data source columns together to get a single variable
    # Then, note those that are "Both" by replacing the mangled name
    merged.df$datsrc <- paste(merged.df$datsrc.x,merged.df$datsrc.y,sep="")
    merged.df[merged.df$datsrc==paste(name.x,name.y,sep=""),"datsrc"] <- "Both"

    # Remove the data frame-specific variables
    merged.df$datsrc.x <- rm()
    merged.df$datsrc.y <- rm()

    return(merged.df)
}

合并两个R数据帧并识别每行的来源

3 个答案: