Question

我有两个表 - 一个包含所有user_ids及其属性，另一个只包含有趣的user_id及其属性。我想查询它们以创建机器学习问题的训练集。

在纯SQL中，我这样做：

select label, user_id, feature 
from (
   select 1 as label, user_id, feature
   from interesting_table

   UNION ALL

   select 0 as label, a.user_id, a.feature
   from alldata_table a
   left join
   interesting table b
   on a.user_id = b.user_id
   where b.user_id is null
)

在Spark中，从interesting_table拉出很容易，但interesting_table和alldata_table之间的左连接证明是昂贵的。我应该

在sql中完成上述操作，然后将结果作为数据框提取？
创建interesting_table和alldata_table作为数据框并使用.join（）运算符？
创建interesting_table和alldata_table作为数据框，通过否定＆＃39; .isin（）＆＃39;来获取interesting_df.user_id和子集alldata_df.user_id的唯一成员
别的什么？

Answer 1

我不肯定这是最好的答案，但我最终将数据帧API与广播一起使用。

alldata_table = spark.table('alldata_table')
interesting_table = spark.table('interesting_table')
interesting_table.withColumnRenamed('user_id','user_id_interesting')

new_table = alldata_table.join(broadcast(interesting_table),
  cond=[alldata_table['user_id']==interesting_table['user_id_interesting']],
  how='left_outer')
new_table.filter(new_table['user_id_interesting'].isnull())

当然这假设interesting_table小到可以广播。据推测，它可以简化为user_id字段以使其更小。

如何有效地在Spark中加入？

1 个答案: