基于给定权重向量的随机选择问题

时间:2012-12-06 22:43:00

标签: java algorithm random weka sample

我正在阅读基于给定权重向量重新采样数组的weka实现。阅读完代码后,我不确定这个实现的基础算法是什么。另外,我对使用这两行代码感到困惑:

  Utils.normalize(probabilities, sumProbs / sumOfWeights);

// Make sure that rounding errors don't mess things up
probabilities[numInstances() - 1] = sumOfWeights;

我不知道它们的用途。以下是从Weka复制的代码

Instances weka::core::Instances::resampleWithWeights(Random random,double[] weights )       
{

if (weights.length != numInstances()) {
  throw new IllegalArgumentException("weights.length != numInstances.");
}
Instances newData = new Instances(this, numInstances());
if (numInstances() == 0) {
  return newData;
}
double[] probabilities = new double[numInstances()];
double sumProbs = 0, sumOfWeights = Utils.sum(weights);
for (int i = 0; i < numInstances(); i++) {
  sumProbs += random.nextDouble();
  probabilities[i] = sumProbs;
}
Utils.normalize(probabilities, sumProbs / sumOfWeights);

// Make sure that rounding errors don't mess things up
probabilities[numInstances() - 1] = sumOfWeights;
int k = 0; int l = 0;
sumProbs = 0;
while ((k < numInstances() && (l < numInstances()))) {
  if (weights[l] < 0) {
  throw new IllegalArgumentException("Weights have to be positive.");
  }
  sumProbs += weights[l];
  while ((k < numInstances()) &&
       (probabilities[k] <= sumProbs)) { 
  newData.add(instance(l));
  newData.instance(k).setWeight(1);
  k++;
  }
  l++;
}
return newData;

}

1 个答案:

答案 0 :(得分:0)

第一个代码片段:

Utils.normalize(probabilities, sumProbs / sumOfWeights);

probabilities的每个元素除以第二个参数。这会将probabilities从最大元素为sumProbs的数组转换为最大元素为sumOfWeights的数组。第二段代码:

probabilities[numInstances() - 1] = sumOfWeights;

只是确保最后一个(最大)元素实际上是sumOfWeights并且没有被某种舍入错误抛弃。

编辑以下是关于整个方法如何运作的理论。上半部分(直到kl的声明)生成probabilities作为(不是独立的)随机数的向量,这些随机数正在增加,最后一个是权重之和。这是区间[0,sumOfWeights]的随机分区。现在权重本身是相同间隔的分区。隐式地,每个现有实例被分配给基于权重的分区的每个元素。

该方法的后半部分只是沿着权重分区(使用索引l)。它对l th 实例进行采样的次数与随机分区落在指示的权重分区中的次数相同。我意识到这个解释有点笨拙。也许正在发生的事情的图片将有所帮助:

0                                                   sumOfWeights
↓                                                       ↓

|     *   *         *       *               * *     *   * ← Random partition
|    ^      ^           ^      ^     ^     ^         ^  ^ ← Weights partition

   0     2        1        1       0     0       3     1  ← # of samples

该方法的后半部分只计算每个权重区间(由*限定)有多少随机分区边界(由^表示)。一点点考虑应该说服你,这是一种有效的方法,可以根据给定的权重随机抽样。