Question

假设我有以下df

df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
> df
  col1 col2 col3
1    1    2 <NA>
2    3    4 <NA>
3    1    2    c

我的目标是根据col1和col2删除所有重复的行，以便更长的行＆＃34;幸存＆＃34;。在这种情况下，应删除第一行。我试过了

df[duplicated(df[, 1:2]), ]

但这只给了我第三行（而不是第三行和第二行）。怎么做得好？

编辑：真实df有15列，其中前13列用于识别重复项。在最后两列中，大约2/3的行填充有NA（前13列不包含任何NA）。因此，我的示例df具有误导性，因为有两列要排除以识别重复项。对不起，我很抱歉。

Answer 1

你可以试试这个：

library(dplyr)
df %>% group_by(col1,col2) %>%
  slice(which.min(is.na(col3)))

或者这个：

df %>%
  group_by(col1,col2) %>%
  arrange(col3) %>%
  slice(1)

# # A tibble: 2 x 3
# # Groups:   col1, col2 [2]
#    col1  col2   col3
#   <dbl> <dbl> <fctr>
# 1     1     2      c
# 2     3     4     NA

一般解决方案

使用最通用的解决方案，col1的每个值只能有一行，请参阅下面的注释，将col2添加到分组变量中。它假定所有NAs都在右侧。

df %>% mutate(nna = df %>% is.na  %>% rowSums) %>%
  group_by(col1) %>%         # or group_by(col1,col2)
  slice(which.min(nna)) %>%
  select(-nna)

Answer 2

df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
df <- df[order(df$col3),] 

duplicates <- duplicated(df[,1:2])
duplicates_sub <- subset(df , duplicates == FALSE)  

> duplicates_sub
  col1 col2 col3
3    1    2    c
2    3    4 <NA>

编辑：保留所有非NA行

df <- data.frame(col1 = c(1, 3, 1,3, 1), col2 = c(2, 4, 2,4, 2), col3 = c("a", NA, "c",NA, "b"))
df <- df[order(df$col3),] 
duplicates <- duplicated(df[,1:2]) & is.na(df[,3])
duplicates_sub <- subset(df , duplicates == FALSE)  

> duplicates_sub
  col1 col2 col3
1    1    2    a
5    1    2    b
3    1    2    c
2    3    4 <NA>

Answer 3

您可以在丢弃欺骗之前将NA排序到顶部或底部：

# in base, which puts NAs last
odf = df[do.call(order, df), ]
odf[!duplicated(odf[, c("col1", "col2")]), ]

#   col1 col2 col3
# 3    1    2    c
# 2    3    4 <NA>

# or with data.table, which puts NAs first
library(data.table)
DF = setorder(data.table(df))
unique(DF, by=c("col1", "col2"), fromLast=TRUE)

#    col1 col2 col3
# 1:    1    2    c
# 2:    3    4   NA

dplyr不能采用这种方法，dplyr并不提供所有列排序＆＃34;在arrange中，fromLast中也distinct。

如何根据某些列删除重复的行（较短的行）？

3 个答案: