Apache Pig中的IN运算符

时间:2020-10-23 01:23:44

标签: apache-pig

是否有与Apache Pig等效的IN运算符?我目前正在使用Apache Pig 0.10.0

我想做类似的事情:

select count(distinct(o.order_id)),count(od.prod_id),count(od.prod_id)/count(distinct(o.order_id)) 
    from orders o 
    inner join order_details od 
    on od.order_id=o.order_id 
    where o.order_id 
    in (
        select * 
        from (select o.order_id 
                from orders o 
                inner join order_details od 
                on od.order_id = o.order_id 
                where(o.order_date between '2013-05-01' and '2013-05-31') and (od.prod_id=1274348)
        ) as subq
    );

1 个答案:

答案 0 :(得分:0)

这可能是 Pig 中的等效脚本。您可以根据需要创建任意数量的临时关系,以便在生成计数之前获取所需的数据。请注意,我已将日期视为时间戳;您可以使用内置的 ToDate UDF,它可以将 UNIX 时间戳或日期作为字符数组转换为原生 Pig DateTime 类型。

-- Load in all of your data
-- Replace with actual paths
-- You may need to supply a delimiter value

o = LOAD 'orders' USING PigStorage() AS (
    order_date:long,
    order_id:chararray
);

od = LOAD 'order_details' USING PigStorage() AS (
    order_id:chararray,
    prod_id:chararray
);

-- Filter like WHERE in SQL
-- Replace 1000 and 2000 with actual timestamps

o_filtered = FILTER o BY order_date <= 2000 AND order_date >= 1000;

od_filtered = FILTER od BY prod_id == '1274348';

-- Inner join - only needed once in Pig

subq = JOIN o_filtered BY order_id, od_filtered BY order_id;

-- Drop fields not needed for final counts

subq_renamed = FOREACH subq GENERATE
    o_filtered::order_id AS order_id,
    od_filtered::prod_id AS prod_id;

-- To do the counts, need to group the data

subq_counts = FOREACH (GROUP subq_renamed ALL) {
    dist_order_id = DISTINCT subq_renamed.order_id;
    GENERATE
    COUNT(dist_order_id) AS dist_order_id_count,
    COUNT(subq_renamed.prod_id) AS prod_id_count;
}

-- Calculate the ratio count(od.prod_id)/count(distinct(o.order_id))

final_counts = FOREACH subq_counts GENERATE *,
    (float)prod_id_count/dist_order_id_count AS prod_order_ratio;
相关问题