随机抽样的字符向量,没有相互加前缀的元素

时间:2015-06-10 23:42:17

标签: r performance combinatorics

考虑一个字符向量pool,其元素是(零填充)二进制数,最多为max_len个数字。

max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x) 
  do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))

pool
##  [1] "0"    "1"    "00"   "10"   "01"   "11"   "000"  "100"  "010"  "110" 
## [11] "001"  "101"  "011"  "111"  "0000" "1000" "0100" "1100" "0010" "1010"
## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111"

我想对这些元素的n进行抽样,其约束条件是任何采样元素都不是任何其他采样元素的prefixes(即如果我们采样1101 ,我们禁止111110,而如果我们采样1,我们会禁止以1开头的元素,例如10 },11100等。)

以下是我使用while的尝试,但当n很大(或接近2^max_len时)时,这种情况很慢。

set.seed(1)
n <- 10
chosen <- sample(pool, n)
while(any(rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1)) {
  prefixes <- rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1
  pool <- pool[rowSums(Vectorize(grepl, 'pattern')(
    paste0('^', chosen[!prefixes]), pool)) == 0]
  chosen <- c(chosen[!prefixes], sample(pool, sum(prefixes)))
}

chosen
## [1] "0100" "0101" "0001" "0011" "1000" "111"  "0000" "0110" "1100" "0111"

最初可以从pool中删除那些包含在pool中的元素不足以获取大小为n的总样本的元素,从而略微改善这一点。例如,当max_len = 4n > 9时,我们可以立即从0删除1pool,因为通过包含其中一个,最大样本将为9(或者0以及以1开头的八个4个字符元素,或1以及以0开头的八个4个字符元素。

基于这个逻辑,我们可以在获取初始样本之前省略pool中的元素,例如:

pool <- pool[
  nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)]

有人能想到更好的方法吗?我觉得我忽略了一些更简单的东西。

修改

为了澄清我的意图,我将把游戏池描绘成一组分支,其中交汇点和提示是节点(pool的元素)。假设绘制了下图中的黄色节点(即010)。现在,从池中删除整个红色“分支”,它由节点0,01和010组成。这就是我的意思,禁止对我们样本中已经存在“前缀”节点的节点进行采样(以及我们的示例中已经前缀的那些节点)。

enter image description here

如果采样节点位于分支的中间位置(如下图中的01),则不允许所有红色节点(0,01,010和011),因为0前缀为01,01前缀为010和011。

enter image description here

我并不是要在每个交叉点(或者沿着分支在硬叉上翻转硬币)对 1 0进行采样 - 这两者都可以样本,只要:(1)节点的父母(或祖父母等)或子女(孙子等)尚未采样; (2)在对节点进行采样时,将剩余足够的节点以实现所需的大小为n的样本。

在上面的第二个图中,如果010是第一个选择,则黑色节点上的所有节点仍然(当前)有效,假设为n <= 4。例如,如果n==4我们接下来采样节点1(因此我们的选择现在包括01和1),我们随后将禁止节点00(由于上面的规则2)但仍然可以选择000和001,给我们我们的4元素样本。另一方面,如果n==5,则在此阶段不允许节点1。

8 个答案:

答案 0 :(得分:19)

简介

这是我们在另一个答案中实现的字符串算法的数字变体。它更快,不需要创建或排序池。

算法大纲

我们可以使用整数来表示二进制字符串,这极大地简化了池生成和顺序消除值的问题。例如,对于max_len==3,我们可以将数字1--(其中-表示填充)表示为十进制的4。此外,如果我们选择此号码,我们可以确定需要排除的号码是44 + 2 ^ x - 1之间的号码。此处x是填充元素的数量(在本例中为2),因此要消除的数字介于44 + 2 ^ 2 - 1之间(或4和{{1}之间},表示为7100110)。

为了完全匹配您的问题,我们需要一点皱纹,因为您处理的数字可能与二进制相同,因为算法的某些部分是不同的。例如,11110010-都是相同的数字,但需要在您的方案中区别对待。在1--世界中,我们有8个可能的数字,但有14个可能的表示形式:

max_len==3

所以0和4有三种可能的编码,2和6有两种,其他所有只有一种。我们需要生成一个整数池,表示具有多个表示的数字的更高选择概率,以及跟踪该数字包括多少空白的机制。我们可以通过在数字末尾添加几位来表示我们想要的权重。所以我们的数字变成了(我们在这里使用两位):

0 - 000: 0--, 00-
1 - 001:
2 - 010: 01-
3 - 011:
4 - 100: 1--, 10-
5 - 101:
6 - 110: 11-
7 - 111:

一些有用的属性:

  • jbaum | int | bin | bin.enc | int.enc 0-- | 0 | 000 | 00000 | 0 00- | 0 | 000 | 00001 | 1 000 | 0 | 000 | 00010 | 2 001 | 1 | 001 | 00100 | 3 01- | 2 | 010 | 01000 | 4 010 | 2 | 010 | 01001 | 5 011 | 3 | 011 | 01101 | 6 1-- | 4 | 100 | 10000 | 7 10- | 4 | 100 | 10001 | 8 100 | 4 | 100 | 10010 | 9 101 | 5 | 101 | 10100 | 10 11- | 6 | 110 | 11000 | 11 110 | 6 | 110 | 11001 | 12 111 | 7 | 111 | 11100 | 13 表示编码需要多少位(本例中为两个)
  • enc.bits告诉我们有多少数字是明确指定的
  • int.enc %% enc.bits返回int.enc %/% enc.bits
  • int返回int * 2 ^ enc.bits + explicitly.specified

请注意,int.enc在我们的实现中位于explicitly.specified0之间,因为始终至少指定了一位数字。我们现在有一个仅使用整数完全表示数据结构的编码。我们可以从整数中采样并使用正确的权重等来重现您期望的结果。这种方法的一个限制是我们在R中使用32位整数,并且我们必须为编码保留一些位,因此我们将自己限制为max_len - 1左右。如果你使用双精度浮点数指定的整数,你可能会变大,但我们在这里没有这样做。

避免重复选择

有两种粗略的方法可以确保我们不会选择相同的值两次

  1. 跟踪可供选择的值,并从
  2. 中随机抽样
  3. 从所有可能的值中随机抽样,然后检查该值是否先前已被选中,如果有,则再次采样
  4. 虽然第一个选项看起来最干净,但实际上它的计算成本非常高。它要求对每个拾取进行矢量扫描,以预先取消所选值的取消,或者创建包含非取消限定值的收缩矢量。缩小选项仅比矢量扫描更有效,如果矢量通过C代码通过引用缩小,但即使这样,它也需要重复转换矢量的潜在大部分,并且它需要C。

    这里我们使用方法#2。这允许我们随机地随机改变可能值的范围,然后按顺序选择每个值,检查它是否已被取消资格,如果有,则选择另一个,等等。这是有效的,因为检查是否一个是微不足道的由于我们的值编码,我们选择了一个值; 我们可以根据单独的值推断排序表中值的位置。因此,我们在排序表中记录每个值的状态,并可以通过直接索引访问更新或查找该状态(无需扫描)。

    实施例

    此算法在基础R中的实现可以 a gist 获得。这个特定的实现只能提取完整的绘图。以下是来自max_len==25池的8个8个元素的抽样示例:

    max_len==4

    我们最初还有两个依赖方法#1来避免重复的实现,一个在基础R中,一个在C中,但即使是C版本也不像新的基本R版本{{{ 1}}很大。这些函数确实实现了绘制不完整绘图的能力,因此我们在此提供它们以供参考:

    比较基准

    这是一组基准测试,比较了本Q / A中显示的几个功能。以毫秒为单位的时间。 # each column represents a draw from a `max_len==4` pool set.seed(6); replicate(10, sample0110b(4, 8)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] "1000" "1" "0011" "0010" "100" "0011" "0" "011" "0100" "1011" [2,] "111" "0000" "1101" "0000" "0110" "0100" "1000" "00" "0101" "1001" [3,] "0011" "0110" "1001" "0100" "0000" "0101" "1101" "1111" "10" "1100" [4,] "0100" "0010" "0000" "0101" "1101" "101" "1011" "1101" "0110" "1101" [5,] "101" "0100" "1100" "1100" "0101" "1001" "1001" "1000" "1111" "1111" [6,] "110" "0111" "1011" "111" "1011" "110" "1111" "0100" "0011" "000" [7,] "0101" "0101" "111" "011" "1010" "1000" "1100" "101" "0001" "0101" [8,] "011" "0001" "01" "1010" "0011" "1110" "1110" "1001" "110" "1000" 版本是本答案中描述的版本。 n是最初的实现,brodie.b是原始实现的C.所有这些都强制要求完整的样本。 brodie是另一个答案中基于字符串的版本。

    brodie.C

    这对较大的游泳池来说相对较好

    brodie.str

    基准笔记:

    • frank,brodie(减去brodie.str)不需要任何预生成池,这将影响比较(见下文)
    • Josilber是LP版本
    • jbaum是OP示例
    • tensibai稍微修改为退出而不是在池为空时失败
    • 没有设置运行python,因此无法比较/帐户缓冲
    • size n jbaum josilber frank tensibai brodie.b brodie brodie.C brodie.str 1 4 10 11 1 3 1 1 1 1 0 2 4 50 - - - 1 - - - 1 3 4 100 - - - 1 - - - 0 4 4 256 - - - 1 - - - 1 5 4 1000 - - - 1 - - - 1 6 8 10 1 290 6 3 2 2 1 1 7 8 50 388 - 8 8 3 4 3 4 8 8 100 2,506 - 13 18 6 7 5 5 9 8 256 - - 22 27 13 14 12 6 10 8 1000 - - - 27 - - - 7 11 16 10 - - 615 688 31 61 19 424 12 16 50 - - 2,123 2,497 28 276 19 1,764 13 16 100 - - 4,202 4,807 30 451 23 3,166 14 16 256 - - 11,822 11,942 40 1,077 43 8,717 15 16 1000 - - 38,132 44,591 83 3,345 130 27,768 代表不可行的选择或合理的时间太慢

    时间不包括为大小system.time(sample0110b(18, 100000)) user system elapsed 8.441 0.079 8.527 -和{{1}分别绘制池0.82.5401毫秒}),这对于4816运行是必需的,或者对它们进行排序(jbaumjosilberbrodie.str毫秒大小为0.12.73700),除了平局之外,4也是必需的。是否要包含这些取决于您为特定池运行该函数的次数。此外,几乎可以肯定有更好的方法来生成/排序池。

    这是使用8进行三次运行的中间时间。代码为 available as a gist ,但请注意,您必须加载16brodie.strmicrobenchmark和{{1}事先确定函数。

答案 1 :(得分:15)

我发现这个问题很有趣所以我尝试用R中的技能非常低来获得这个(所以这可能会有所改进):

较新的编辑版本,感谢@Franck建议:

library(microbenchmark)
library(lineprof)

max_len <- 16
pool <- unlist(lapply(seq_len(max_len), function(x) 
  do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
n<-100

library(stringr)
tree_sample <- function(samples,pool) {
  results <- vector("integer",samples)
  # Will be used on a regular basis, compute it in advance
  PoolLen <- str_length(pool)
  # Make a mask vector based on the length of each entry of the pool
  masks <- strtoi(str_pad(str_pad("1",PoolLen,"right","1"),max_len,"right","0"),base=2)

  # Make an integer vector from "0" right padded orignal: for max_len=4 and pool entry "1" we get "1000" => 8
  # This will allow to find this entry as parent of 10 and 11 which become "1000" and "1100", as integer 8 and 12 respectively
  # once bitwise "anded" with the repective mask "1000" the first bit is striclty the same, so it's a parent.
  integerPool <- strtoi(str_pad(pool,max_len,"right","0"),base=2)

  # Create a vector to filter the available value to sample
  ok <- rep(TRUE,length(pool))

  #Precompute the result of the bitwise and betwwen our integer pool and the masks   
  MaskedPool <- bitwAnd(integerPool,masks)

  while(samples) {
    samp <- sample(pool[ok],1) # Get a sample
    results[samples] <- samp # Store it as result
    ok[pool == samp] <- FALSE # Remove it from available entries

    vsamp <- strtoi(str_pad(samp,max_len,"right","0"),base=2) # Get the integer value of the "0" right padded sample
    mlen <- str_length(samp) # Get sample len

    #Creation of unitary mask to remove childs of sample
    mask <- strtoi(paste0(rep(1:0,c(mlen,max_len-mlen)),collapse=""),base=2)

    # Get the result of bitwise And between the integerPool and the sample mask 
    FilterVec <- bitwAnd(integerPool,mask)

    # Get the bitwise and result of the sample and it's mask
    Childm <- bitwAnd(vsamp,mask)

    ok[FilterVec == Childm] <- FALSE  # Remove from available entries the childs of the sample
    ok[MaskedPool == bitwAnd(vsamp,masks)] <- FALSE # compare the sample with all the masks to remove parents matching

    samples <- samples -1
  }
  print(results)
}
microbenchmark(tree_sample(n,pool),times=10L)

主要思想是使用bitmask comparison来知道一个样本是否是另一个样本的父(公共位部分),如果是,则从池中抑制该元素。

我的机器上从长度为16的池中抽取100个样本需要1,4秒。

答案 2 :(得分:13)

您可以对池进行排序,以帮助确定取消资格的要素。例如,查看三元素排序池:

 [1] "0"   "00"  "000" "001" "01"  "010" "011" "1"   "10"  "100" "101" "11" 
[13] "110" "111"

我可以告诉我,我可以取消所选项目之后的任何内容,这些项目的字符数多于我的项目,而第一项的字符数相同或更少。例如,如果我选择&#34; 01&#34;,我可以立即看到接下来的两个项目(&#34; 010&#34;,&#34; 011&#34;)需要删除,但不能之后的那个因为&#34; 1&#34;人物较少。删除&#34; 0&#34;事后很容易。这是一个实现:

library(fastmatch)  # could use `match`, but we repeatedly search against same hash

# `pool` must be sorted!

sample01 <- function(pool, n) {
  picked <- logical(length(pool))
  chrs <- nchar(pool)
  pick.list <- character(n)
  pool.seq <- seq_along(pool)

  for(i in seq(n)) {
    # Make sure pool not exhausted

    left <- which(!picked)
    left.len <- length(left)
    if(!length(left)) break

    # Sample from pool

    seq.left <- seq.int(left)
    pool.left <- pool[left]
    chrs.left <- chrs[left]
    pick <- sample(length(pool.left), 1L)

    # Find all the elements with more characters that are disqualified
    # and store their indices in `valid` (bad name...)

    valid.tmp <- chrs.left > chrs.left[[pick]] & seq.left > pick
    first.invalid <- which(!valid.tmp & seq.left > pick)
    valid <- if(length(first.invalid)) {
      pick:(first.invalid[[1L]] - 1L)
    } else pick:left.len

    # Translate back to original pool indices since we're working on a 
    # subset in `pool.left`

    pool.seq.left <- pool.seq[left]
    pool.idx <- pool.seq.left[valid]
    val <- pool[[pool.idx[[1L]]]]

    # Record the picked value, and all the disqualifications

    pick.list[[i]] <- val
    picked[pool.idx] <- TRUE

    # Disqualify shorter matches

    to.rem <- vapply(
      seq.int(nchar(val) - 1), substr, character(1L), x=val, start=1L
    )
    to.rem.idx <- fmatch(to.rem, pool, nomatch=0)
    picked[to.rem.idx] <- TRUE  
  }
  pick.list  
}

一个制作排序池的函数(与你的代码完全一样,但返回已排序):

make_pool <- function(size)
  sort(
    unlist(
      lapply(
        seq_len(size), 
        function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))) 
  ) ) )

然后,使用max_len 3池(用于直观地检查事物的行为符合预期):

pool3 <- make_pool(3)
set.seed(1)
sample01(pool3, 8)
# [1] "001" "1"   "010" "011" "000" ""    ""    ""   
sample01(pool3, 8)
# [1] "110" "111" "011" "10"  "00"  ""    ""    ""   
sample01(pool3, 8)
# [1] "000" "01"  "11"  "10"  "001" ""    ""    ""   
sample01(pool3, 8)
# [1] "011" "101" "111" "001" "110" "100" "000" "010"    

请注意,在最后一种情况下,我们得到了所有3位数的二进制组合(2 ^ 3),因为我们偶然会从3位数的二进制组合中取样。此外,只有3个大小的游泳池,有许多样本可以防止完整的8次抽奖;你可以用你的建议来解决这个问题,即消除阻止从游泳池中完全抽取的组合。

这很快。查看使用备用解决方案需要2秒的max_len==9示例:

pool9 <- make_pool(9)
microbenchmark(sample01(pool9, 4))
# Unit: microseconds
#                expr     min      lq  median      uq     max neval
#  sample01(pool9, 4) 493.107 565.015 571.624 593.791 983.663   100    

大约半毫秒。您也可以合理地尝试相当大的池:

pool16 <- make_pool(16)  # 131K entries
system.time(sample01(pool16, 100))
#  user  system elapsed 
# 3.407   0.146   3.552 

这不是很快,但我们正在谈论一个拥有130K物品的游泳池。还有可能进行额外的优化。

请注意,对于大型池,排序步骤相对较慢,但我没有计算它,因为您只需要执行一次,并且您可能想出一个合理的算法来生成池前排序

还有一种更快的整数到二进制的方法的可能性,我在一个现在已删除的答案中进行了探讨,但这需要更多的工作才能与您正在寻找的内容紧密相关。

答案 3 :(得分:13)

将ID映射到字符串。您可以将数字映射到0/1向量,如@BrodieG所述:

# some key objects

n_pool      = sum(2^(1:max_len))      # total number of indices
cuts        = cumsum(2^(1:max_len-1)) # new group starts
inds_by_g   = mapply(seq,cuts,cuts*2) # indices grouped by length

# the mapping to strings (one among many possibilities)

library(data.table)
get_01str <- function(id,max_len){
    cuts = cumsum(2^(1:max_len-1))
    g    = findInterval(id,cuts)
    gid  = id-cuts[g]+1

    data.table(g,gid)[,s:=
      do.call(paste,c(list(sep=""),lapply(
        seq(g[1]), 
        function(x) (gid-1) %/% 2^(x-1) %% 2
      )))
    ,by=g]$s      
} 

查找要删除的ID。我们将从采样池中依次删除id

 # the mapping from one index to indices of nixed strings

get_nixstrs <- function(g,gid,max_len){

    cuts         = cumsum(2^(1:max_len-1))
    gids_child   = {
      x = gid%%2^sequence(g-1)
      ifelse(x,x,2^sequence(g-1))
    }
    ids_child    = gids_child+cuts[sequence(g-1)]-1

    ids_parent   = if (g==max_len) gid+cuts[g]-1 else {

      gids_par       = vector(mode="list",max_len)
      gids_par[[g]]  = gid
      for (gg in seq(g,max_len-1)) 
        gids_par[[gg+1]] = c(gids_par[[gg]],gids_par[[gg]]+2^gg)

      unlist(mapply(`+`,gids_par,cuts-1))
    }

    c(ids_child,ids_parent)
}

索引按g分组,字符数为nchar(get_01str(id))。由于索引按g排序,g=findInterval(id,cuts)是更快的路线。

g中的索引,1 < g < max_len有一个大小为g-1的“子”索引和两个大小为g+1的父索引。对于每个子节点,我们获取其子节点,直到我们点击g==1;对于每个父节点,我们采用它们的父节点对,直到我们点击g==max_len

就组中的标识符gid而言,树的结构最简单。 gid映射到父母gidgid+2^g;并且反转这个映射可以找到孩子。

<强>取样

drawem <- function(n,max_len){
    cuts        = cumsum(2^(1:max_len-1))
    inds_by_g   = mapply(seq,cuts,cuts*2)

    oklens = (1:max_len)[ n <= 2^max_len*(1-2^(-(1:max_len)))+1 ]
    okinds = unlist(inds_by_g[oklens])

    mysamp = rep(0,n)
    for (i in 1:n){

        id        = if (length(okinds)==1) okinds else sample(okinds,1)
        g         = findInterval(id,cuts)
        gid       = id-cuts[g]+1
        nixed     = get_nixstrs(g,gid,max_len)

        # print(id); print(okinds); print(nixed)

        mysamp[i] = id
        okinds    = setdiff(okinds,nixed)
        if (!length(okinds)) break
    }

    res <- rep("",n)
    res[seq.int(i)] <- get_01str(mysamp[seq.int(i)],max_len)
    res
}

oklens部分整合了OP的想法,省略了保证不能进行采样的字符串。然而,即使这样做,我们也可能遵循一条不再有选择的采样路径。以OP max_len=4n=10的示例为例,我们知道我们必须放弃01,但如果我们的前四次抽奖是00会发生什么? ,011110?哦,我想我们运气不好。这就是您应该实际定义采样概率的原因。 (OP有另一个想法,用于在每个步骤确定哪些节点将导致不可能的状态,但这似乎是一个很高的顺序。)

<强>插图

# how the indices line up

n_pool = sum(2^(1:max_len)) 
pdt <- data.table(id=1:n_pool)
pdt[,g:=findInterval(id,cuts)]
pdt[,gid:=1:.N,by=g]
pdt[,s:=get_01str(id,max_len)]

# example run

set.seed(4); drawem(5,5)
# [1] "01100" "1"     "0001"  "0101"  "00101"

set.seed(4); drawem(8,4)
# [1] "1100" "0"    "111"  "101"  "1101" "100"  ""     ""  

基准(比@ BrodieG的答案更早)

require(rbenchmark)
max_len = 8
n = 8

benchmark(
      jos_lp     = {
        pool <- unlist(lapply(seq_len(max_len),
          function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
        sample.lp(pool, n)},
      bro_string = {pool <- make_pool(max_len);sample01(pool,n)},
      fra_num    = drawem(n,max_len),
      replications=5)[1:5]
#         test replications elapsed relative user.self
# 2 bro_string            5    0.05      2.5      0.05
# 3    fra_num            5    0.02      1.0      0.02
# 1     jos_lp            5    1.56     78.0      1.55

n = 12
max_len = 12
benchmark(
  bro_string={pool <- make_pool(max_len);sample01(pool,n)},
  fra_num=drawem(n,max_len),
  replications=5)[1:5]
#         test replications elapsed relative user.self
# 1 bro_string            5    0.54     6.75      0.51
# 2    fra_num            5    0.08     1.00      0.08

其他答案。还有另外两个答案:

jos_enum = {pool <- unlist(lapply(seq_len(max_len), 
    function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
  get.template(pool, n)}
bro_num  = sample011(max_len,n)    

我遗漏了@ josilber的枚举方法,因为它花了太长时间;和@ BrodieG的数字/索引方法,因为它当时不起作用,但现在确实如此。有关更多基准测试,请参阅@ BrodieG的更新答案。

速度与正确性。虽然@ josilber的答案要慢得多(对于枚举方法,显然会占用大量内存),但他们可以保证绘制一个大小为n的样本在第一次尝试。使用@ BrodieG的字符串方法或此答案,您将不得不一次又一次地重新取样,以期绘制完整的n。对于大max_len,我认为这不应该是一个问题。

此答案比bro_string缩放得更好,因为它不需要预先构建pool

答案 4 :(得分:13)

它是在python而不是r,但是jbaums说它没关系。

所以这是我的贡献,请参阅来源中的评论以解释关键部分 我仍在研究分析解决方案,以确定深度ct样本树的可能组合S的数量,因此我可以改进函数combs 。也许有人拥有它? 这真的是现在的瓶颈。

从我的笔记本电脑上的深度为16的树中抽取100个节点需要大约8毫秒。 这不是第一次,但是由于combBuffer被填充,所以你采样得越多就越快到某一点。

import random


class Tree(object):
    """
    :param level: The distance of this node from the root.
    :type level: int
    :param parent: This trees parent node
    :type parent: Tree
    :param isleft: Determines if this is a left or a right child node. Can be
                   omitted if this is the root node.
    :type isleft: bool

    A binary tree representing possible strings which match r'[01]{1,n}'. Its
    purpose is to be able to sample n of its nodes where none of the sampled
    nodes' ids is a prefix for another one.
    It is possible to change Tree.maxdepth and then reuse the root. All
    children are created ON DEMAND, which means everything is lazily evaluated.
    If the Tree gets too big anyway, you can call 'prune' on any node to delete
    its children.

        >>> t = Tree()
        >>> t.sample(8, toString=True, depth=3)
        ['111', '110', '101', '100', '011', '010', '001', '000']
        >>> Tree.maxdepth = 2
        >>> t.sample(4, toString=True)
        ['11', '10', '01', '00']
    """

    maxdepth = 10
    _combBuffer = {}

    def __init__(self, level=0, parent=None, isleft=None):
        self.parent = parent
        self.level = level
        self.isleft = isleft
        self._left = None
        self._right = None

    @classmethod
    def setMaxdepth(cls, depth):
        """
        :param depth: The new depth
        :type depth: int

        Sets the maxdepth of the Trees. This basically is the depth of the root
        node.
        """
        if cls.maxdepth == depth:
            return

        cls.maxdepth = depth

    @property
    def left(self):
        """This tree's left child, 'None' if this is a leave node"""
        if self.depth == 0:
            return None

        if self._left is None:
            self._left = Tree(self.level+1, self, True)
        return self._left

    @property
    def right(self):
        """This tree's right child, 'None' if this is a leave node"""
        if self.depth == 0:
            return None

        if self._right is None:
            self._right = Tree(self.level+1, self, False)
        return self._right

    @property
    def depth(self):
        """
        This tree's depth. (maxdepth-level)
        """
        return self.maxdepth-self.level

    @property
    def id(self):
        """
        This tree's id, string of '0's and '1's equal to the path from the root
        to this subtree. Where '1' means going left and '0' means going right.
        """
        # level 0 is the root node, it has no id
        if self.level == 0:
            return ''
        # This takes at most Tree.maxdepth recursions. Therefore
        # it is save to do it this way. We could also save each nodes
        # id once it is created to avoid recreating it every time, however
        # this won't save much time but use quite some space.
        return self.parent.id + ('1' if self.isleft else '0')

    @property
    def leaves(self):
        """
        The amount of leave nodes, this tree has. (2**depth)
        """
        return 2**self.depth

    def __str__(self):
        return self.id

    def __len__(self):
        return 2*self.leaves-1

    def prune(self):
        """
        Recursively prune this tree's children.
        """
        if self._left is not None:
            self._left.prune()
            self._left.parent = None
            self._left = None

        if self._right is not None:
            self._right.prune()
            self._right.parent = None
            self._right = None

    def combs(self, n):
        """
        :param n: The amount of samples to be taken from this tree
        :type n: int
        :returns: The amount of possible combinations to choose n samples from
                  this tree

        Determines recursively the amount of combinations of n nodes to be
        sampled from this tree.
        Subsequent calls with same n on trees with same depth will return the
        result from the previous computation rather than computing it again.

            >>> t = Tree()
            >>> Tree.maxdepth = 4
            >>> t.combs(16)
            1
            >>> Tree.maxdepth = 3
            >>> t.combs(6)
            58
        """

        # important for the amount of combinations is only n and the depth of
        # this tree
        key = (self.depth, n)

        # We use the dict to save computation time. Calling the function with
        # equal values on equal nodes just returns the alrady computed value if
        # possible.
        if key not in Tree._combBuffer:
            leaves = self.leaves

            if n < 0:
                N = 0
            elif n == 0 or self.depth == 0 or n == leaves:
                N = 1
            elif n == 1:
                return (2*leaves-1)
            else:
                if n > leaves/2:
                    # if n > leaves/2, at least n-leaves/2 have to stay on
                    # either side, otherweise the other one would have to
                    # sample more nodes than possible.
                    nMin = n-leaves/2
                else:
                    nMin = 0

                # The rest n-2*nMin is the amount of samples that are free to
                # fall on either side
                free = n-2*nMin

                N = 0
                # sum up the combinations of all possible splits
                for addLeft in range(0, free+1):
                    nLeft = nMin + addLeft
                    nRight = n - nLeft
                    N += self.left.combs(nLeft)*self.right.combs(nRight)

            Tree._combBuffer[key] = N
            return N
        return Tree._combBuffer[key]

    def sample(self, n, toString=False, depth=None):
        """
        :param n: How may samples to take from this tree
        :type n: int
        :param toString: If 'True' result will direclty be turned into a list
                         of strings
        :type toString: bool
        :param depth: If not None, will overwrite Tree.maxdepth
        :type depth: int or None
        :returns: List of n nodes sampled from this tree
        :throws ValueError: when n is invalid

        Takes n random samples from this tree where none of the sample's ids is
        a prefix for another one's.

        For an example see Tree's docstring.
        """
        if depth is not None:
            Tree.setMaxdepth(depth)

        if toString:
            return [str(e) for e in self.sample(n)]

        if n < 0:
            raise ValueError('Negative sample size is not possible!')

        if n == 0:
            return []

        leaves = self.leaves
        if n > leaves:
            raise ValueError(('Cannot sample {} nodes, with only {} ' +
                              'leaves!').format(n, leaves))

        # Only one sample to choose, that is nice! We are free to take any node
        # from this tree, including this very node itself.
        if n == 1 and self.level > 0:
            # This tree has 2*leaves-1 nodes, therefore
            # the probability that we keep the root node has to be
            # 1/(2*leaves-1) = P_root. Lets create a random number from the
            # interval [0, 2*leaves-1).
            # It will be 0 with probability 1/(2*leaves-1)
            P_root = random.randint(0, len(self)-1)
            if P_root == 0:
                return [self]
            else:
                # The probability to land here is 1-P_root

                # A child tree's size is (leaves-1) and since it obeys the same
                # rule as above, the probability for each of its nodes to
                # 'survive' is 1/(leaves-1) = P_child.
                # However all nodes must have equal probability, therefore to
                # make sure that their probability is also P_root we multiply
                # them by 1/2*(1-P_root). The latter is already done, the
                # former will be achieved by the next condition.
                # If we do everything right, this should hold:
                # 1/2 * (1-P_root) * P_child = P_root

                # Lets see...
                # 1/2 * (1-1/(2*leaves-1)) * (1/leaves-1)
                # (1-1/(2*leaves-1)) * (1/(2*(leaves-1)))
                # (1-1/(2*leaves-1)) * (1/(2*leaves-2))
                # (1/(2*leaves-2)) - 1/((2*leaves-2) * (2*leaves-1))
                # (2*leaves-1)/((2*leaves-2) * (2*leaves-1)) - 1/((2*leaves-2) * (2*leaves-1))
                # (2*leaves-2)/((2*leaves-2) * (2*leaves-1))
                # 1/(2*leaves-1)
                # There we go!
                if random.random() < 0.5:
                    return self.right.sample(1)
                else:
                    return self.left.sample(1)

        # Now comes the tricky part... n > 1 therefore we are NOT going to
        # sample this node. Its probability to be chosen is 0!
        # It HAS to be 0 since we are definitely sampling from one of its
        # children which means that this node will be blocked by those samples.
        # The difficult part now is to prove that the sampling the way we do it
        # is really random.

        if n > leaves/2:
            # if n > leaves/2, at least n-leaves/2 have to stay on either
            # side, otherweise the other one would have to sample more
            # nodes than possible.
            nMin = n-leaves/2
        else:
            nMin = 0
        # The rest n-2*nMin is the amount of samples that are free to fall
        # on either side
        free = n-2*nMin

        # Let's have a look at an example, suppose we were to distribute 5
        # samples among two children which have 4 leaves each.
        # Each child has to get at least 1 sample, so the free samples are 3.
        # There are 4 different ways to split the samples among the
        # children (left, right):
        # (1, 4), (2, 3), (3, 2), (4, 1)
        # The amount of unique sample combinations per child are
        # (7, 1), (11, 6), (6, 11), (1, 7)
        # The amount of total unique samples per possible split are
        #   7   ,   66  ,   66  ,    7
        # In case of the first and last split, all samples have a probability
        # of 1/7, this was already proven above.
        # Lets suppose we are good to go and the per sample probabilities for
        # the other two cases are (1/11, 1/6) and (1/6, 1/11), this way the
        # overall per sample probabilities for the splits would be:
        #  1/7  ,  1/66 , 1/66 , 1/7
        # If we used uniform random to determine the split, all splits would be
        # equally probable and therefore be multiplied with the same value (1/4)
        # But this would mean that NOT every sample is equally probable!
        # We need to know in advance how many sample combinations there will be
        # for a given split in order to find out the probability to choose it.
        # In fact, due to the restrictions, this becomes very nasty to
        # determine. So instead of solving it analytically, I do it numerically
        # with the method 'combs'. It gives me the amount of possible sample
        # combinations for a certain amount of samples and a given tree depth.
        # It will return 146 for this node and 7 for the outer and 66 for the
        # inner splits.
        # What we now do is, we take a number from [0, 146).
        # if it is smaller than 7, we sample from the first split,
        # if it is smaller than 7+66, we sample from the second split,
        # ...
        # This way we get the probabilities we need.

        r = random.randint(0, self.combs(n)-1)
        p = 0
        for addLeft in xrange(0, free+1):
            nLeft = nMin + addLeft
            nRight = n - nLeft

            p += (self.left.combs(nLeft) * self.right.combs(nRight))
            if r < p:
                return self.left.sample(nLeft) + self.right.sample(nRight)
        assert False, ('Something really strange happend, p did not sum up ' +
                       'to combs or r was too big')


def main():
    """
    Do a microbenchmark.
    """
    import timeit
    i = 1
    main.t = Tree()
    template = ' {:>2}  {:>5} {:>4}  {:<5}'
    print(template.format('i', 'depth', 'n', 'time (ms)'))
    N = 100
    for depth in [4, 8, 15, 16, 17, 18]:
        for n in [10, 50, 100, 150]:
            if n > 2**depth:
                time = '--'
            else:
                time = timeit.timeit(
                    'main.t.sample({}, depth={})'.format(n, depth), setup=
                    'from __main__ import main', number=N)*1000./N
            print(template.format(i, depth, n, time))
            i += 1


if __name__ == "__main__":
    main()

基准输出:

  i  depth    n  time (ms)
  1      4   10  0.182511806488
  2      4   50  --   
  3      4  100  --   
  4      4  150  --   
  5      8   10  0.397620201111
  6      8   50  1.66054964066
  7      8  100  2.90236949921
  8      8  150  3.48146915436
  9     15   10  0.804011821747
 10     15   50  3.7428188324
 11     15  100  7.34910964966
 12     15  150  10.8230614662
 13     16   10  0.804491043091
 14     16   50  3.66818904877
 15     16  100  7.09567070007
 16     16  150  10.404779911
 17     17   10  0.865840911865
 18     17   50  3.9999294281
 19     17  100  7.70257949829
 20     17  150  11.3758206367
 21     18   10  0.915451049805
 22     18   50  4.22935962677
 23     18  100  8.22361946106
 24     18  150  12.2081303596

来自深度为10的树的10个10号样本:

['1111010111', '1110111010', '1010111010', '011110010', '0111100001', '011101110', '01110010', '01001111', '0001000100', '000001010']
['110', '0110101110', '0110001100', '0011110', '0001111011', '0001100010', '0001100001', '0001100000', '0000011010', '0000001111']
['11010000', '1011111101', '1010001101', '1001110001', '1001100110', '10001110', '011111110', '011001100', '0101110000', '001110101']
['11111101', '110111', '110110111', '1101010101', '1101001011', '1001001100', '100100010', '0100001010', '0100000111', '0010010110']
['111101000', '1110111101', '1101101', '1101000000', '1011110001', '0111111101', '01101011', '011010011', '01100010', '0101100110']
['1111110001', '11000110', '1100010100', '101010000', '1010010001', '100011001', '100000110', '0100001111', '001101100', '0001101101']
['111110010', '1110100', '1101000011', '101101', '101000101', '1000001010', '0111100', '0101010011', '0101000110', '000100111']
['111100111', '1110001110', '1100111111', '1100110010', '11000110', '1011111111', '0111111', '0110000100', '0100011', '0010110111']
['1101011010', '1011111', '1011100100', '1010000010', '10010', '1000010100', '0111011111', '01010101', '001101', '000101100']
['111111110', '111101001', '1110111011', '111011011', '1001011101', '1000010100', '0111010101', '010100110', '0100001101', '0010000000']

答案 5 :(得分:11)

如果您不想生成所有可能元组的集合,然后随机采样(对于大输入大小,您可能会认为这是不可行的),另一个选项是使用整数编程绘制单个样本。基本上,您可以为pool中的每个元素分配一个随机值,然后选择具有最大值总和的可行元组。这应该给每个元组选择相同的概率,因为它们都是相同的大小,并且它们的值是随机选择的。模型的约束将确保不会选择任何不允许的元组对,并且选择了正确数量的元素。

以下是lpSolve包的解决方案:

library(lpSolve)
sample.lp <- function(pool, max_len) {
  pool <- sort(pool)
  pml <- max(nchar(pool))
  runs <- c(rev(cumsum(2^(seq(pml-1)))), 0)
  banned.from <- rep(seq(pool), runs[nchar(pool)])
  banned.to <- banned.from + unlist(lapply(runs[nchar(pool)], seq_len))
  banned.constr <- matrix(0, nrow=length(banned.from), ncol=length(pool))
  banned.constr[cbind(seq(banned.from), banned.from)] <- 1
  banned.constr[cbind(seq(banned.to), banned.to)] <- 1
  mod <- lp(direction="max",
            objective.in=runif(length(pool)),
            const.mat=rbind(banned.constr, rep(1, length(pool))),
            const.dir=c(rep("<=", length(banned.from)), "=="),
            const.rhs=c(rep(1, length(banned.from)), max_len),
            all.bin=TRUE)
  pool[which(mod$solution == 1)]
}
set.seed(144)
pool <- unlist(lapply(seq_len(4), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
sample.lp(pool, 4)
# [1] "0011" "010"  "1000" "1100"
sample.lp(pool, 8)
# [1] "0000" "0100" "0110" "1001" "1010" "1100" "1101" "1110"

这似乎可以扩展到相当大的池。例如,从510大小的池中获取长度为20的样本需要2秒多一点时间:

pool <- unlist(lapply(seq_len(8), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
length(pool)
# [1] 510
system.time(sample.lp(pool, 20))
#    user  system elapsed 
#   0.232   0.008   0.239 

如果您真的需要解决非常大的问题规模,那么您可以从lpSolve附带的非开源解算器转移到像gurobi或cplex这样的商业求解器(一般不是免费的,但是免费的学术用途)。

答案 6 :(得分:9)

一种方法是使用迭代方法简单地生成适当大小的所有可能元组:

  1. 构建大小为1的所有元组(pool中的所有元素)
  2. 使用pool
  3. 中的元素制作交叉产品
  4. 使用pool多次使用相同元素
  5. 删除任何元组
  6. 删除另一个元组的任何完全重复
  7. 删除任何不能一起使用的元组
  8. 冲洗并重复,直到获得适当的元组大小
  9. 对于给定的尺寸(pool长度为30,max_len 4),这是可以运行的:

    get.template <- function(pool, max_len) {
      banned <- which(outer(paste0('^', pool), pool, Vectorize(grepl)), arr.ind=T)
      banned <- banned[banned[,1] != banned[,2],]
      banned <- paste(banned[,1], banned[,2])
      vals <- matrix(seq(length(pool)))
      for (k in 2:max_len) {
        vals <- cbind(vals[rep(1:nrow(vals), each=length(pool)),],
                      rep(1:length(pool), nrow(vals)))
        # Can't sample same value more than once
        vals <- vals[apply(vals, 1, function(x) length(unique(x)) == length(x)),]
        # Sort rows to ensure unique only
        vals <- t(apply(vals, 1, sort))
        vals <- unique(vals)
        # Can't have banned pair
        combos <- combn(ncol(vals), 2)
        for (k in seq(ncol(combos))) {
            c1 <- combos[1,k]
            c2 <- combos[2,k]
            vals <- vals[!paste(vals[,c1], vals[,c2]) %in% banned,]
        }
      }
      return(matrix(pool[vals], nrow=nrow(vals)))
    }
    
    max_len <- 4
    pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
    system.time(template <- get.template(pool, 4))
    #   user  system elapsed 
    #  4.549   0.050   4.614 
    

    现在,您可以根据需要从template的行中进行多次采样(这将非常快),这与从定义的空间中随机采样相同。

答案 7 :(得分:9)

简介

我发现这个问题非常有趣,我觉得有必要考虑一下,最终提供我自己的答案。由于我到达的算法没有立即跟随问题描述,我将首先解释我是如何得到这个解决方案,然后在C ++中提供一个示例实现(我从未写过R)。

解决方案的开发

第一眼

阅读问题描述最初令人困惑,但是当我看到编辑树的图片时,我立即理解了问题,我的直觉表明二叉树也是一个解决方案:构建一个树(一个集合大小为1的树木,并在选择后消除树枝和祖先后将树木打成一堆较小的树木。

虽然这看起来很好,但选择过程和维护收集会很痛苦。不过,树似乎应该在任何解决方案中发挥重要作用。

修订版1

不要打破这棵树。而是在每个节点处具有布尔数据有效载荷,指示它是否已被消除。这只留下一棵保持形式的树。

但请注意,这不仅仅是任何二叉树,它实际上是一个深度为max_len-1的完整二叉树。

修订版2

完整的二叉树可以很好地表示为数组。典型的数组表示使用树的广度优先搜索,具有以下属性:

Let x be the array index.
x = 0 is the root of the entire tree
left_child(x) = 2x + 1
right_child(x) = 2x + 2
parent(x) = floor((n-1)/2)

在下图中,每个节点都标有其数组索引: breadth-first

作为一个数组,这占用更少的内存(没有更多的指针),使用的内存是连续的(适用于缓存),并且可以完全在堆栈而不是堆上(假设您的语言为您提供了选择)。当然,这里适用一些条件,特别是阵列的大小。我稍后会回来。

与修订版1中一样,存储在数组中的数据将是布尔值:true表示可用,false表示不可用。由于根节点实际上不是有效选择,因此应将索引0初始化为false。如何进行选择的问题仍然存在:

由于指标已经被删除,因此记录有多少被消除,以及剩下多少被删除是微不足道的。选择该范围内的随机数,然后遍历数组,直到看到许多指标设置为true(包括当前索引)。到达的指数是选择。 选择,直到选择了n个指示,或者没有任何内容可供选择。

这是一个完整的算法,但是在选择过程中还有改进的余地,并且还存在一些尚未解决的实际尺寸问题:数组大小为O(2 ^ N)。随着n变大,首先缓存效益消失,然后数据开始被分页到磁盘,并且在某些时候根本无法存储。

修订版3

我决定先解决更容易的问题:改进选择过程。

从左到右扫描阵列非常浪费。跟踪已被消除的范围而不是连续检查和发现几个故障可能更有效。然而,我们的树表示对此并不理想,因为每轮中将被淘汰的节点很少在阵列中是连续的。

通过重新排列数组映射到树的方式,可以更好地利用它。特别是,让我们使用预订深度优先搜索而不是广度优先搜索。为了做到这一点,树需要固定大小,这就是这个问题的情况。儿童节点和父节点的索引如何以数学方式连接也不太明显。

depth-first

通过使用这种安排,每个不是叶子的选择都可以保证消除一个连续的范围:它的子树。

修订版4

通过跟踪消除的范围,不再需要真/假数据,因此根本不需要数组或树。 在每次随机抽取时,消除的范围可用于快速找到要选择的节点。所有祖先和整个子树都被删除,可以表示为可以轻松与其他子集合的范围。

最后一项任务是将所选节点转换为OP所需的字符串表示形式。这很容易,因为这个二叉树仍然保持严格的顺序:从根遍历,所有元素&gt; =右边的孩子在右边,其他元素在左边。因此,搜索树将通过附加&#39; 0&#39;提供祖先列表和二进制字符串。向左走时;或者&#39; 1&#39;当正确行进时。

示例实施

#include <stdint.h>
#include <algorithm>
#include <cmath>
#include <list>
#include <deque>
#include <ctime>
#include <cstdlib>
#include <iostream>

/*
 * A range of values of the form (a, b), where a <= b, and is inclusive.
 * Ex (1,1) is the range from 1 to 1 (ie: just 1)
 */
class Range
{
private:
    friend bool operator< (const Range& lhs, const Range& rhs);
    friend std::ostream& operator<<(std::ostream& os, const Range& obj);

    int64_t m_start;
    int64_t m_end;

public:
    Range(int64_t start, int64_t end) : m_start(start), m_end(end) {}
    int64_t getStart() const { return m_start; }
    int64_t getEnd() const { return m_end; }
    int64_t size() const { return m_end - m_start + 1; }
    bool canMerge(const Range& other) const {
        return !((other.m_start > m_end + 1) || (m_start > other.m_end + 1));
    }
    int64_t merge(const Range& other) {
        int64_t change = 0;
        if (m_start > other.m_start) {
            change += m_start - other.m_start;
            m_start = other.m_start;
        }
        if (other.m_end > m_end) {
            change += other.m_end - m_end;
            m_end = other.m_end;
        }
        return change;
    }
};

inline bool operator< (const Range& lhs, const Range& rhs){return lhs.m_start < rhs.m_start;}
std::ostream& operator<<(std::ostream& os, const Range& obj) {
    os << '(' << obj.m_start << ',' << obj.m_end << ')';
    return os;
}

/*
 * Stuct to allow returning of multiple values
 */
struct NodeInfo {
    int64_t subTreeSize;
    int64_t depth;
    std::list<int64_t> ancestors;
    std::string representation;
};

/*
 * Collection of functions representing a complete binary tree
 * as an array created using pre-order depth-first search,
 * with 0 as the root.
 * Depth of the root is defined as 0.
 */
class Tree
{
private:
    int64_t m_depth;
public:
    Tree(int64_t depth) : m_depth(depth) {}
    int64_t size() const {
        return (int64_t(1) << (m_depth+1))-1;
    }
    int64_t getDepthOf(int64_t node) const{
        if (node == 0) { return 0; }
        int64_t searchDepth = m_depth;
        int64_t currentDepth = 1;
        while (true) {
            int64_t rightChild = int64_t(1) << searchDepth;
            if (node == 1 || node == rightChild) {
                break;
            } else if (node > rightChild) {
                node -= rightChild;
            } else {
                node -= 1;
            }
            currentDepth += 1;
            searchDepth -= 1;
        }
        return currentDepth;
    }
    int64_t getSubtreeSizeOf(int64_t node, int64_t nodeDepth = -1) const {
        if (node == 0) {
            return size();
        }
        if (nodeDepth == -1) {
            nodeDepth = getDepthOf(node);
        }
        return (int64_t(1) << (m_depth + 1 - nodeDepth)) - 1;
    }
    int64_t getLeftChildOf(int64_t node, int64_t nodeDepth = -1) const {
        if (nodeDepth == -1) {
            nodeDepth = getDepthOf(node);
        }
        if (nodeDepth == m_depth) { return -1; }
        return node + 1;
    }
    int64_t getRightChildOf(int64_t node, int64_t nodeDepth = -1) const {
        if (nodeDepth == -1) {
            nodeDepth = getDepthOf(node);
        }
        if (nodeDepth == m_depth) { return -1; }
        return node + 1 + ((getSubtreeSizeOf(node, nodeDepth) - 1) / 2);
    }
    NodeInfo getNodeInfo(int64_t node) const {
        NodeInfo info;
        int64_t depth = 0;
        int64_t currentNode = 0;
        while (currentNode != node) {
            if (currentNode != 0) {
                info.ancestors.push_back(currentNode);
            }
            int64_t rightChild = getRightChildOf(currentNode, depth);
            if (rightChild == -1) {
                break;
            } else if (node >= rightChild) {
                info.representation += '1';
                currentNode = rightChild;
            } else {
                info.representation += '0';
                currentNode = getLeftChildOf(currentNode, depth);
            }
            depth++;
        }
        info.depth = depth;
        info.subTreeSize = getSubtreeSizeOf(node, depth);
        return info;
    }
};

// random selection amongst remaining allowed nodes
int64_t selectNode(const std::deque<Range>& eliminationList, int64_t poolSize, std::mt19937_64& randomGenerator)
{
    std::uniform_int_distribution<> randomDistribution(1, poolSize);
    int64_t selection = randomDistribution(randomGenerator);
    for (auto const& range : eliminationList) {
        if (selection >= range.getStart()) { selection += range.size(); }
        else { break; }
    }
    return selection;
}

// determin how many nodes have been elimintated
int64_t countEliminated(const std::deque<Range>& eliminationList)
{
    int64_t count = 0;
    for (auto const& range : eliminationList) {
        count += range.size();
    }
    return count;
}

// merge all the elimination ranges to listA, and return the number of new elimintations
int64_t mergeEliminations(std::deque<Range>& listA, std::deque<Range>& listB) {
    if(listB.empty()) { return 0; }
    if(listA.empty()) {
        listA.swap(listB);
        return countEliminated(listA);
    }

    int64_t newEliminations = 0;
    int64_t x = 0;
    auto listA_iter = listA.begin();
    auto listB_iter = listB.begin();
    while (listB_iter != listB.end()) {
        if (listA_iter == listA.end()) {
            listA_iter = listA.insert(listA_iter, *listB_iter);
            x = listB_iter->size();
            assert(x >= 0);
            newEliminations += x;
            ++listB_iter;
        } else if (listA_iter->canMerge(*listB_iter)) {
            x = listA_iter->merge(*listB_iter);
            assert(x >= 0);
            newEliminations += x;
            ++listB_iter;
        } else if (*listB_iter < *listA_iter) {
            listA_iter = listA.insert(listA_iter, *listB_iter) + 1;
            x = listB_iter->size();
            assert(x >= 0);
            newEliminations += x;
            ++listB_iter;
        } else if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) {
            listA_iter->merge(*(listA_iter+1));
            listA_iter = listA.erase(listA_iter+1);
        } else {
            ++listA_iter;
        }
    }
    while (listA_iter != listA.end()) {
        if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) {
            listA_iter->merge(*(listA_iter+1));
            listA_iter = listA.erase(listA_iter+1);
        } else {
            ++listA_iter;
        }
    }
    return newEliminations;
}

int main (int argc, char** argv)
{
    std::random_device rd;
    std::mt19937_64 randomGenerator(rd());

    int64_t max_len = std::stoll(argv[1]);
    int64_t num_samples = std::stoll(argv[2]);

    int64_t samplesRemaining = num_samples;
    Tree tree(max_len);
    int64_t poolSize = tree.size() - 1;
    std::deque<Range> eliminationList;
    std::deque<Range> eliminated;
    std::list<std::string> foundList;

    while (samplesRemaining > 0 && poolSize > 0) {
        // find a valid node
        int64_t selectedNode = selectNode(eliminationList, poolSize, randomGenerator);
        NodeInfo info = tree.getNodeInfo(selectedNode);
        foundList.push_back(info.representation);
        samplesRemaining--;

        // determine which nodes this choice eliminates
        eliminated.clear();
        for( auto const& ancestor : info.ancestors) {
            Range r(ancestor, ancestor);
            if(eliminated.empty() || !eliminated.back().canMerge(r)) {
                eliminated.push_back(r);
            } else {
                eliminated.back().merge(r);
            }
        }
        Range r(selectedNode, selectedNode + info.subTreeSize - 1);
        if(eliminated.empty() || !eliminated.back().canMerge(r)) {
            eliminated.push_back(r);
        } else {
            eliminated.back().merge(r);
        }

        // add the eliminated nodes to the existing list
        poolSize -= mergeEliminations(eliminationList, eliminated);
    }

    // Print some stats
    // std::cout << "tree: " << tree.size() << " samplesRemaining: "
    //                       << samplesRemaining << " poolSize: "
    //                       << poolSize << " samples: " << foundList.size()
    //                       << " eliminated: "
    //                       << countEliminated(eliminationList) << std::endl;

    // Print list of binary strings
    // std::cout << "list:";
    // for (auto const& s : foundList) {
    //  std::cout << " " << s;
    // }
    // std::cout << std::endl;
}

其他想法

此算法对max_len的扩展性非常好。使用n进行缩放并不是很好,但根据我自己的分析,它似乎比其他解决方案做得更好。

可以轻松修改此算法,以允许包含超过仅仅&#39; 0&#39;的字符串。和&#39; 1&#39;。 中更多可能的字母会增加树的扇出,并且会在每次选择时消除更宽的范围 - 仍然会使每个子树中的所有节点保持连续。 / p>