如何从两个表中选择平衡随机样本记录?

时间:2012-10-02 16:37:44

标签: sql postgresql

为了训练机器学习模型,我必须检索由平衡数量的当前用户和以前用户组成的用户样本。 Tha数据库由表all_users和former_users组成。

如果样本不平衡(100条记录),以下查询将返回包含所需列的记录:

SELECT t1.user_property1, t2.user_property2, t3.valid_to FROM additional_info t1 LEFT JOIN all_users t2 ON t1.user_ID = t2.user_ID LEFT JOIN former_users t3 ON t1.user_ID = t3.user_ID ORDER BY random() LIMIT 100

为了获得平衡样本,应该有一半的用户记录存储在表previous_users中,一半存储在表all_users中,同时,这些记录不在table former_users中(否则样本将不会被平衡) )。

有谁知道,从表additional_info中的附加属性中从表all_users和former_users中检索平衡随机样本的最方便方法是什么?

谢谢!

2 个答案:

答案 0 :(得分:1)

您可能会考虑做的一件事是:

Query 1 - SELECTS random non-former users joined to additional_info with a LIMIT of 50
Query 2 - SELECTS random former users joined to additional_info with a LIMIT of 50

然后将结果与UNION

组合
(Query 1) UNION (Query 2)

这将为您提供两个条件的随机结果,总共有100个用户。

答案 1 :(得分:1)

做了以下事情:

(SELECT t1.user_property1, t2.user_property2, t3.valid_to FROM additional_info t1 LEFT JOIN all_users t2 ON t1.user_ID = t2.user_ID INNER JOIN former_users t3 ON t1.user_ID = t3.user_ID ORDER BY random() LIMIT 50)
UNION
(SELECT t1.user_property1, t2.user_property2, NULL FROM additional_info t1 LEFT JOIN all_users t2 ON t1.user_ID = t2.user_ID WHERE t1.user NOT IN (SELECT user_ID FROM former_users) ORDER BY random() LIMIT 50)

但正在寻找更好的解决方案。

相关问题