Question

我有一系列各种（通常是变化的）长度的文本文件。存储在文件中的数据按照从最常见（顶部）到最少（底部）的特定顺序排列。我想从文件中随机选择一行加权到顶部条目 - 例如，如果文件中有322个条目，则第1行的选择可能比第322行高322倍。

我一直在将文件的内容附加到列表中以通过len函数获取长度然后将其作为数学问题接近，但我想知道（希望）Python有更聪明的方法来实现这一目标吗？

Answer 1

假设您将值存储在字典中，如下所示：

cities = {
    'City1': 3298181,
    'City2': 3013491,
    'City3': 900129,
    ...
}

您可以使用random库执行此操作：

from random import choice

choice([k for k in cities for x in xrange(cities[k])])

说明：

choice内的生成器将生成一个可迭代的list对象，每个城市名称的重复次数与居住在那里的人数相同。

示例：

 >>> cities = {'1': 3, '2': 1}
 >>> [k for k in cities for x in xrange(cities[k])]
 ['1', '1', '1', '2']

小心采用这种方法，因为如果有很多城市，每个城市都有很多人，那么阵列将变得非常庞大。

同样不要尝试使用range()代替xrange()，因为它不是生成器，因为存储了大量数据，它会导致您的电脑冻结。

Answer 2

接受的答案似乎与所写的OP要求不一致（尽管实际上可能如此）所以这里有另一个答案，它解决了从加权文件中随机选择一行的一般问题概率。这来自Python 3文档中的random module examples。

在这种情况下，文件的第1行的选择概率大于最后一行，并且中间线的概率降低，因此我们的权重为range(n, 0, -1)，其中n是行数文件，例如如果文件中有5行，那么权重将为[5, 4, 3, 2, 1]，这将对应于以下概率：

weights = range(5, 0, -1)
total_weights = float(sum(weights))
probabilities = [w/total_weights for w in weights]
>>> [round(p, 5) for p in probabilities]    # rounded for readability
[0.33333, 0.26667, 0.2, 0.13333, 0.06667]

因此第一行的概率比最后一行大5倍，每行的概率降低。

接下来，我们需要根据权重构建累积分布，选择该分布中的随机值，在分布中定位随机值，并使用它来从文件中检索一行。这是一些代码。

import bisect
import random
try:
    from itertools import accumulate     # Python >= 3.2
except ImportError:
    def accumulate(weights):
        accumulator = 0
        for w in weights:
            accumulator += w
            yield accumulator

def count(iterable):
    return sum(1 for elem in iterable)

def get_nth(iterable, n):
    assert isinstance(n, int), "n must be an integer, got %r" % type(n)
    assert n > 0, "n must be greater than 0, got %r" % n
    for i, elem in enumerate(iterable, 1):
        if i == n:
            return elem

def weighted_select(filename):
    with open(filename) as f:
        n = count(f)
        if n == 0:
            return None

        # set up cumulative distribution
        weights = range(n, 0, -1)
        cumulative_dist = list(accumulate(weights))

        # select line number
        x = random.random() * cumulative_dist[-1]
        selected_line = bisect.bisect(cumulative_dist, x)

        # retrieve line from file
        f.seek(0)
        return get_nth(f, selected_line + 1)    # N.B. +1 for nth line

根据我对问题的解释，它使用权重。它很容易使其适应其他重量，例如如果您希望使用城市人口作为权重进行加权选择，则只需将weights = range(n, 0, -1)更改为与文件中每行对应的人口列表。

Answer 3

这是一个非常着名的场景，我认为它与随机抽样的名称一致。以下代码将在python中运行。

    from random import randint

    f = open(filename, 'r')
    out = None
    n = 0
    for line in f:
        n = n + 1
        if random()>1/n:
            out = line
    print out

从可变长度文本文件中加权随机选择

3 个答案: