Question

所以我对这个问题有类似的问题： Remove duplicate rows in R

在我的情况下，我想保留所有列（不像建议在前3列使用unique函数）。我想从数据框中只考虑2列，如果＆＃34;值＆＃34;只保留1行。在两个提到的列中是相同的。

数据如下：

structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", 
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange", 
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(NA, 
NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"), 
    P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair", 
    "Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"), 
    P2_location_subacon = structure(c(1L, 1L, 1L, 1L, NA, NA, 
    NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge", 
    "Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L, 
    3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed", 
    "Table,Shelf,Fridge"), class = "factor")), .Names = c("P1", 
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon", 
"P2_location_all_predictors"), row.names = c(NA, -20L), class = "data.frame")

我的重要栏目是：P1和P2。我想只保留其中一行，我们可以使用相同的水果/蔬菜。（请记住，两个栏目中的水果/蔬菜必须相同）：

示例：

之前：

       P1       P2 P1_location_subacon            P1_location_all_predictors P2_location_subacon P2_location_all_predictors
1   Apple   Orange                <NA>       Table,Shelf,Cupboard,Bed,Fridge              Fridge         Table,Shelf,Fridge
2   Apple   Orange                <NA>       Table,Shelf,Cupboard,Bed,Fridge              Fridge         Table,Shelf,Fridge
3  Orange    Lemon              Fridge                    Table,Shelf,Fridge              Fridge           Shelf,Fridge,Bed
4  Orange    Lemon              Fridge                    Table,Shelf,Fridge              Fridge           Shelf,Fridge,Bed
5  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge
6  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge
7  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge
8  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge

后：

    P1       P2 P1_location_subacon            P1_location_all_predictors P2_location_subacon P2_location_all_predictors
1   Apple   Orange                <NA>       Table,Shelf,Cupboard,Bed,Fridge              Fridge         Table,Shelf,Fridge
4  Orange    Lemon              Fridge                    Table,Shelf,Fridge              Fridge           Shelf,Fridge,Bed
5  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge

它将保留哪一行并不重要。这可以随机选择。

Answer 1

只需在要确保唯一的列子集上使用duplicated()，并使用它来对主data.frame进行子集化。例如

dd[ !duplicated(dd[,c("P1","P2")]) , ]

Answer 2

如果dt是您的数据框 -

library(data.table)
setDT(dt)

dtFiltered = dt[,
   Flag := .I - min(.I), 
   list(P1,P2)
][
   Flag == 0
]
dtFiltered = dtFiltered[,
  Flag := NULL
]

感谢Frank指出我错过了P2。

Answer 3

试试这个：

NSManagedObject

删除＆＃34;重复＆＃34;来自数据框的行（它们在几列中不同）

3 个答案: