两组样本n

时间:2017-09-06 19:45:47

标签: r dplyr conditional

我有一个如下所示的数据集:

d=data.frame(ID = rep(1:7,1), 
             Group1=c('A','C','B','C','C','A','B'),
             Group2=c('B','A','C','B','B','B','D'))

 ID Group1 Group2
  1      A      B
  2      C      A
  3      B      C
  4      C      B
  5      C      B
  6      A      B
  7      B      D

我需要根据Group1随机抽样1个案例。 Group1有三种类型:A,B,C。我需要从每种类型中取样1。

同时,样本的Group2类型不会在样本的Group2中重复。

例如,如果我只根据Group1进行采样:

dsample=d %>% group_by(Group1) %>%sample_n(size=1)

然后样本如下:

ID Group1 Group2

 1      A      B
 7      B      D
 4      C      B

在样品的Group2中,样品中重复了B.为避免重复Group2类型,当按照Group1类型进行采样时,采样应选择ID = 2,以便样本看起来像这样:

ID Group1 Group2

 1      A      B
 7      B      D
 2      C      A

3 个答案:

答案 0 :(得分:1)

一种可能的方法:继续重新取样,直到获得理想的结果(或者直到你失败了足够多次才能达到预期的结果):

# data
d=data.frame(ID = rep(1:7,1), 
             Group1=c('A','C','B','C','C','A','B'),
             Group2=c('B','A','C','B','B','B','D'))

# first attempt
dsample = d %>% group_by(Group1) %>% sample_n(size=1)

# if first attempt doesn't work, try again & again (I put an upper limit at 100 runs)
i = 1
while(length(unique(dsample$Group2)) < nrow(dsample) & i < 100){
  dsample = d %>% group_by(Group1) %>% sample_n(size=1)
  i = i + 1
}

> dsample
# A tibble: 3 x 3
# Groups:   Group1 [3]
     ID Group1 Group2
  <int> <fctr> <fctr>
1     1      A      B
2     3      B      C
3     2      C      A

如果无法获得所需的独特组合:

# example where "A" & "B" in Group 1 both have only "A" as Group2 values
d2=data.frame(ID = rep(1:7,1), 
             Group1=c('A','C','B','C','C','A','B'),
             Group2=c('A','A','A','C','B','A','A'))

# same code as before
d2sample = d2 %>% group_by(Group1) %>% sample_n(size=1)

i = 1
while(length(unique(d2sample$Group2)) < nrow(d2sample) & i < 100){
  d2sample = d2 %>% group_by(Group1) %>% sample_n(size=1)
  i = i + 1
}

# fail after 100 rounds of resampling
> d2sample
# A tibble: 3 x 3
# Groups:   Group1 [3]
     ID Group1 Group2
  <int> <fctr> <fctr>
1     6      A      A
2     7      B      A
3     5      C      B
> i
[1] 100

答案 1 :(得分:1)

我的第一个想法是一个循环,然后我意识到我们可以看看我们如何从不同的角度采样。更好的解决方案是一次只采样一行,然后从仅包含!= Group1和!= Group2之前采样的池中采样下一行。这应该快得多:

f <- function(){
  x <- sample_n(d,1)
  x <- rbind(x,sample_n(d[which(!d$Group1 %in% x$Group1 & !d$Group2 %in% x$Group2),],1))
  x <- rbind(x,sample_n(d[which(!d$Group1 %in% x$Group1 & !d$Group2 %in% x$Group2),],1))
  print(x)
}

f()

  ID Group1 Group2
6  6      A      B
2  2      C      A
3  3      B      C

如果您知道至少有2个唯一可能的样本,则每次都是随机的,非重复的输出。

如果有人建议如何以这种方式更简洁地重复功能,请随时告诉我。但总的来说,似乎这种方式可能是最有效的。

答案 2 :(得分:0)

试试这个递归函数

您的数据

d=data.frame(ID = rep(1:7,1), 
             Group1=c('A','C','B','C','C','A','B'),
             Group2=c('B','A','C','B','B','B','D'))

dsample=d %>% group_by(Group1) %>%sample_n(size=1)

功能

myfun <- function(ans, allowed, restricted, counter, end) {
             allowed <- setdiff(allowed, ans)
             allowed1 <- setdiff(allowed, restricted[counter])
             if (length(allowed) == 0 | counter > end) {
                  if (length(ans) < end) {
                      ans <- c(ans, rep(NA, end-length(ans)))
                  }
                  return(ans)
             } else {
                  counter <- counter + 1
                  ans <- c(ans, sample(allowed1, 1))
                  myfun(ans, allowed, restricted, counter, end)
             }
         }

10次试验的输出

replicate(10,myfun(ans=NULL, unique(d$Group2), dsample$Group1, 1, nrow(dsample)))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "D"  "C"  "B"  "C"  "B"  "B"  "B"  "B"  "D"  "D"  
[2,] "C"  "D"  "D"  "A"  "D"  "C"  "C"  "A"  "A"  "C"  
[3,] "A"  "B"  "A"  "D"  "A"  "A"  "D"  "D"  "B"  "B"

注意每个复制的输出按列组织