比较几个数据框中的项目组 - 按行

时间:2015-03-20 13:50:49

标签: r

我想要比较几个数据框。让我们首先展示两个数据示例:

数据1:

> dput(data1)
structure(list(cluster = c(1, 1, 2, 3, 3, 4, 5, 6, 6, 6, 6, 6, 
6, 6, 7, 8, 9, 10, 11, 11, 11, 11, 12, 12, 12, 13, 13, 13, 13, 
14, 15, 15), description = c("BTB", "BTB", "CVA", "BAS", "TRK", 
"EXT", "LRA", "CAW", "CAW", "CAW", "CAW", "CAW", "TTE", "TTE", 
"MYU", "MTQ", "PLI", "KQA", "STG", "STG", "ATF", "ATF", "REW", 
"REW", "REW", "KIR", "KIR", "ROR", "ROR", "FRQ", "QEQ", "QEQ"
)), .Names = c("cluster", "description"), row.names = c("Mazda RX4", 
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", 
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280", 
"Merc 280C", "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", 
"Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", 
"Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", 
"Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", 
"Volvo 142E", "Volvo 144", "Chrysler", "Ford 131", "Ford 144"
), class = "data.frame")

数据2:

    > dput(data2)
structure(list(cluster = c(3, 4, 5, 5, 5, 6, 6, 3, 3, 6, 7, 8, 
9, 10, 11, 11, 11, 11, 12, 12, 12, 13, 14, 13, 11, 14, 15, 15, 
1, 1, 2, 2), description = c("TRK", "EXT", "LRA", "CAW", "CAW", 
"CAW", "CAW", "CAW", "TTE", "TTE", "MYU", "MTQ", "PLI", "KQA", 
"STG", "STG", "ATF", "ATF", "REW", "REW", "REW", "KIR", "KIR", 
"ROR", "ROR", "FRQ", "QEQ", "QEQ", "BTB", "BTB", "CVA", "BAS"
)), .Names = c("cluster", "description"), row.names = c("Hornet Sportabout", 
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280", 
"Merc 280C", "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", 
"Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", 
"Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", 
"Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", 
"Volvo 142E", "Volvo 144", "Chrysler", "Ford 131", "Ford 144", 
"Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive"), class = "data.frame")

因此,在两个数据集中,我们可以找到相同的row.names和description,但顺序不同。 我想对同一集群中发现的汽车进行比较。所以,我们举个例子"Merc 240D"

它属于cluster ==6和(data):

            cluster description
Merc 240D         6         CAW
Merc 230          6         CAW
Merc 280          6         CAW
Merc 280C         6         CAW
Merc 450SE        6         CAW
Merc 450SL        6         TTE
Merc 450SLC       6         TTE

现在让我们转到第二个data2。此时"Merc 240D"属于群集5以及:

Duster 360                5         LRA
Merc 240D                 5         CAW
Merc 230                  5         CAW

这次同一个群集中只有三辆车,但在两个数据集中只能找到一辆与"Merc 240D"一起的车辆"Merc 230"

我想对我的数据集中的每一行(汽车)执行此类分析。分析它所属的集群,以及谁和其他数据集进行比较。

问题在于,我有20个数据集可供比较。我相信循环是必要的!

作为输出,我想拥有这样的表(只是示例):

               nr_partners  name of partners       Description Descr_partners 
Merc 240D         3         Merc1, Merc2, Merc3       CAW       CAW, TTE, TTE

这样的事情可能吗?在此先感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

如果您只想返回每个表的示例输出表。您可以使用aggregatemerge。以下如何为模型名称执行此操作,您可以适应其他信息:

    # first make a column aggregating all the partners for each cluster
    pasteAlphabetical <- function(vectNames){
        return(paste(sort(vectNames),collapse=","))
    }
    byCluster <-aggregate(row.names(data1),by=list(cluster=data1$cluster),pasteAlphabetical)

    # then you can attribute this to each row
    data1 <- merge(data1,byCluster,by="cluster")

但是如果你想在多个表中看到哪些模型在同一个集群中,你需要在所有表上的集群上使用merge,然后聚合总是在同一群集:

    # get the clusters in each table for each car
    SummarizeClusters <- function(datas){
        for(id in 1:length(datas)) datas[[id]]$names <- row.names(datas[[id]])
        summaryDat <- datas[[1]][,c("cluster","description"),drop=FALSE]
        summaryDat$names <- row.names(datas[[1]])

        for(iData in 2:length(datas)){
            summaryDat <- merge(summaryDat,datas[[iData]],by="names",all=TRUE)
        }

        return(summaryDat)
    }
    datas <- list(data1,data2)
    sumDat <- SummarizeClusters(datas)

    clusterCols <- names(sumDat)[grep("cluster",names(sumDat))] # get cluster column names

    # and then aggregate models that have clusters in common
    alwaysSameClusters<-aggregate(sumDat$names,
            by=sumDat[,clusterCols],pasteAlphabetical)

它为您提供了始终在同一群集中关联的模型列表。

我不确定你想要做什么,但这应该给你遵循的原则,包括大量的数据集。