具有NA数量条件的子集组

时间:2016-11-25 21:19:26

标签: r

我的部分数据如下:

      CUSIP  yearmon datafqtr PRIMEXCH       date   PRC  VOL       RET
1: 00003210 Nov 1970  1970 Q4        A 1970-11-16 9.875 3400 -0.091954
2: 00003210 Nov 1970  1970 Q4        A 1970-11-17 8.750 4100 -0.113924
3: 00003210 Nov 1970  1970 Q4        A 1970-11-18 9.125 5400  0.042857
4: 00003210 Nov 1970  1970 Q4        A 1970-11-19 9.375 3600  0.027397
5: 00003210 Nov 1970  1970 Q4        A 1970-11-20 9.625 3100  0.026667
6: 00003210 Nov 1970  1970 Q4        A 1970-11-23 9.250 1500 -0.038961
   SHROUT NUMTRD    vwretd   ceqq        S          A           A0
1:   2655     NA -0.001385 10.544 24558.75 0.05144521 2.094781e-06
2:   2655     NA  0.000824 10.544 24558.75 0.05144521 2.094781e-06
3:   2655     NA -0.007519 10.544 24558.75 0.05144521 2.094781e-06
4:   2655     NA  0.001180 10.544 24558.75 0.05144521 2.094781e-06
5:   2655     NA  0.009683 10.544 24558.75 0.05144521 2.094781e-06
6:   2655     NA  0.006372 10.544 24558.75 0.05144521 2.094781e-06
        Aplus     Aminus Aplus.market Aminus.market          BTM
1: 0.03421433 0.06293247   0.05269694    0.04643831 0.0004293378
2: 0.03421433 0.06293247   0.05269694    0.04643831 0.0004293378
3: 0.03421433 0.06293247   0.05269694    0.04643831 0.0004293378
4: 0.03421433 0.06293247   0.05269694    0.04643831 0.0004293378
5: 0.03421433 0.06293247   0.05269694    0.04643831 0.0004293378
6: 0.03421433 0.06293247   0.05269694    0.04643831 0.0004293378
    RET.month MOM1 MOM2 MOM3 MOM4
1: -0.1724146   NA   NA   NA   NA
2: -0.1724146   NA   NA   NA   NA
3: -0.1724146   NA   NA   NA   NA
4: -0.1724146   NA   NA   NA   NA
5: -0.1724146   NA   NA   NA   NA
6: -0.1724146   NA   NA   NA   NA

CUSIPyearmon的组合构成了每个单独的组,观察结果是每日频率。我希望将变量VOL中不超过5个缺失值的组中的所有观察值进行子集化。这意味着对于特定月份CUSIP中的特定yearmon,当VOL中有超过5个缺失值时,则此CUSIP的观察结果为月份(yearmon)将从数据中删除。

1 个答案:

答案 0 :(得分:1)

我提供dplyr和base-R方法。

dplyr

我将使用dplyr提供示例,但使用其他data.frame - 管理方法(例如,基础R,data.table)也可轻松完成此操作。

由于你的数据不可用(我),我会做一些:

n <- 50
set.seed(42)
dat <- data_frame(
  CUSIP = sample(c("0001", "0002"), size = n, replace = TRUE),
  yearmon = sample(c("Nov 1970", "Dec 1970"), size = n, replace = TRUE),
  VOL = sample(10000, size = n, replace = TRUE)
)
dat$VOL <- ifelse(runif(n) < 0.2, NA, dat$VOL)
str(dat)
# Classes 'tbl_df', 'tbl' and 'data.frame': 50 obs. of  3 variables:
#  $ CUSIP  : chr  "0002" "0002" "0001" "0002" ...
#  $ yearmon: chr  "Nov 1970" "Nov 1970" "Nov 1970" "Dec 1970" ...
#  $ VOL    : int  6263 2172 2166 3890 9425 9627 NA NA NA 23 ...

这包括两个CUSIP和两个yearmon,每个都有可变数量的无效VOL字段。 (虽然这会产生一个包含5的组,但我会在这里采取一些自由并说你想要“不超过4 NA s。”这种自由是为了简化一个人为的例子,它应该没有任何影响关于您的实际数据和代码的执行情况。)

# demonstrate at least one group with >= 5 NAs
dat %>%
  arrange(CUSIP, yearmon) %>%
  group_by(CUSIP, yearmon) %>%
  summarize(n = sum(is.na(VOL)))
# Source: local data frame [4 x 3]
# Groups: CUSIP [?]
#   CUSIP  yearmon     n
#   <chr>    <chr> <int>
# 1  0001 Dec 1970     4
# 2  0001 Nov 1970     2
# 3  0002 Dec 1970     5
# 4  0002 Nov 1970     4

根据您的逻辑,我们应该完全删除0002Dec 1970的数据。

# same code with the new filter added
dat %>%
  arrange(CUSIP, yearmon) %>%
  group_by(CUSIP, yearmon) %>%
  filter(sum(is.na(VOL)) < 5) %>%
  summarize(n = sum(is.na(VOL)))
# Source: local data frame [3 x 3]
# Groups: CUSIP [?]
#   CUSIP  yearmon     n
#   <chr>    <chr> <int>
# 1  0001 Dec 1970     4
# 2  0001 Nov 1970     2
# 3  0002 Nov 1970     4

此代码仅用于演示;您使用的代码应该是简单的:

VOL_NA_limit <- 5
newdat <- dat %>%
  group_by(CUSIP, yearmon) %>%
  filter(sum(is.na(VOL)) <= VOL_NA_limit)

基础R

如果您不想使用dplyr,可以使用byrbind完成同样的操作:

do.call("rbind", by(dat, list(dat$CUSIP, dat$yearmon), function(df) {
  if (sum(is.na(df$VOL)) < VOL_NA_limit) df else NULL
}))

splitFilter

do.call("rbind",
        Filter(function(df) sum(is.na(df$VOL)) < VOL_NA_limit,
               split(dat, list(dat$CUSIP, dat$yearmon))))

这两种基本方法都比dplyr方法快,但不如说令人印象深刻。

相关问题