Question

我有一个包含多个列的data.frame，并希望根据变量的组合过滤低频数据。这个例子就像男性/女性的性别变量和胆固醇变量的高/低。然后我的数据框就像：

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df


  index    Sex  Age
1      1   Male High
2      2 Female High
3      3   Male High
4      4 Female High
5      5 Female High
6      6   Male High
7      7 Female High
8      8 Female High
9      9 Female  Low
10    10   Male  Low
11    11 Female High
12    12   Male High
13    13 Female High
14    14 Female High
15    15   Male  Low
16    16 Female  Low
17    17   Male High
18    18   Male  Low
19    19   Male  Low
20    20 Female  Low

现在我想过滤频率高于3的性别/年龄组合

table(df[,2:3])
        Age
Sex      High Low
  Female    8   3
  Male      5   4

换句话说，我希望保持女性高，男性低和男性高的指数。

注意 1）我的数据框有几个变量（不像上面的例子）和2）我不希望使用任何第三个R包和3）我希望它快。

Answer 1

这是基础R中的一个简单方法：

lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

#   index    Sex  Age
#1      1   Male High
#2      2 Female High
#3      3   Male High
#4      4 Female High
#5      5 Female High
#6      6   Male High
#7      7 Female High
#8      8 Female High
#10    10   Male  Low
#11    11 Female High
#12    12   Male High
#13    13 Female High
#14    14 Female High
#15    15   Male  Low
#17    17   Male High
#18    18   Male  Low
#19    19   Male  Low

如果您有更多变量，可以将它们存储在矢量中：

vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

这是使用ave的第二个基础R方法：

subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)

Answer 2

好的，这是一个Base-R选项

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df

merge(
    df
    , aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
    , by = c("Sex", "Age")
)

汇总函数sum为所有组合的所有1 s。

Answer 3

我们可以使用StringSplitOptions.RemoveEmptyEntries执行此操作，它也应该有效

data.table

或library(data.table) setDT(df)[, .SD[.N > 3], .(Sex, Age)]

.I

Answer 4

:答案是

dplyr

即使在OP 中声明，这也不是基本的R解决方案。认为它可能对没有此类限制的未来用户有用。

Answer 5

vars     <- c("Sex","Age")
max_freq <- 3
new_df   <- merge(df, subset(as.data.frame(table(df[,vars])),Freq>max_freq)[1:2])

new_df
#       Sex  Age index
# 1  Female High     2
# 2  Female High     7
# 3  Female High    14
# 4  Female High    11
# 5  Female High     5
# 6  Female High     4
# 7  Female High    13
# 8  Female High     8
# 9    Male High     6
# 10   Male High     3
# 11   Male High     1
# 12   Male High    17
# 13   Male High    12
# 14   Male  Low    10
# 15   Male  Low    15
# 16   Male  Low    18
# 17   Male  Low    19

R中数据帧中低频数据滤波的有效方法

5 个答案: