Question

假设我有10.000个预订客户资料。这些配置文件具有以下变量：

持续时间（休假日）
目的地（可能是巴西）
People_amount（多少次）
起飞（他们想要离开的日期）

我希望通过预订引擎传递1.000（10％）的定价，但为了开发独立的见解分析，我必须（尽可能多）均匀分配配置文件的特征。例如。如果所有的配置文件都有3种People_amount（1,2和3），最终我想在33.33％的10％中选择People_amount = 1,33,33％，People_amount = 2和33 ，33％，People_amount = 3。

但是...

因为配置文件集不是均匀分布的（例如70％的所有配置文件都由People_amount = 1组成）我无法弄清楚如何找到/创建一种填满的循环（或其他东西）在该特征中的所有品种的SELECT，直到1用尽，并与其余的进一步。

也许是一个关于我如何填写10k配置文件的10％样本的例子：

Profile_id  People_amount                                     Profile_id  People_amount
1           1                                                           1           1
2           1                                                           5           2
3           1                                                           8           3
4           1       --> Filling the sample by even distribution         2           1
5           2       of available profile characteristics                6           2
6           2                                                           9           3   
7           2                                                           3           1
8           3                                                           7           2
9           3                                                           4           1

希望你能帮忙！

Answer 1

您可以使用union对每个子选择进行限制：

(SELECT * FROM profiles WHERE People_amount=1 LIMIT 333)
UNION
(SELECT * FROM profiles WHERE People_amount=2 LIMIT 333)
UNION
(SELECT * FROM profiles WHERE People_amount=3 LIMIT 333)

需要使用括号将LIMIT应用于每个子选择。

更具活力的方法

如果people_amount的可能值的数量未知，则上述方法不可行。然后我会提出一个查询，其中ORDER BY子句根据出现次数分配people_amount个值。它不会给出完全相等的分布，但不同的值在结果集中具有相似的存在性：

select     p.*
from       (
            select   people_amount,  
                     count(*) as occurrences
            from     profiles
            group by people_amount) as stats
inner join profiles p
        on p.people_amount = stats.people_amount         
order by   rand() * stats.occurrences
limit      1000

SQL fiddle（如果没有超载）。

如果您想将此扩展到其他列，例如Destination，您可以按以下方式执行此操作：

select     p.*
from       (
            select   people_amount,  
                     destination,
                     count(*) as occurrences
            from     profiles
            group by people_amount,
                     destination) as stats
inner join profiles p
        on p.people_amount = stats.people_amount         
       and p.destination = stats.destination
order by   rand() * stats.occurrences
limit      1000

这个想法是，具有低发生率的值将获得更低的阶数值，因此将在结果集的开头更频繁地弹出，以补偿它们的低频率。

MySql：在不均匀分布的数据上均匀分布的样本大小

1 个答案:

更具活力的方法