PySpark:如何在rdd join期间从左表中选择*

时间:2016-06-26 00:13:32

标签: python apache-spark pyspark

如何在pyspark join中选择*

impression_rdd.join(
        click_rdd, 
        impression_rdd.session_id == click_rdd.session_id, 
        "left_outer"
    ).select(impression_rdd.*) <------- pseudo code; how do you do this?

基本上,sql等价

SELECT impression.* FROM impression LEFT JOIN click on (impression.session_id = click.session_id)

2 个答案:

答案 0 :(得分:2)

您可以简单地为您的伪代码添加别名和几个引号:

(impressions.alias("impressions")
    .join(clicks, ["id"], "left_outer")
    .select("impressions.*"))

答案 1 :(得分:1)

zero323答案的另外两个等效结构:

(impressions.join(clicks, 'session_id', 'left_outer')
    .select(*impressions.columns))

如果您只有一列,请说&#39; count&#39;,要放入右侧表格,这可能更具可读性。

(impressions.join(clicks, 'session_id', 'left_outer')
    .drop('count'))
相关问题