Redshift - 简化查询计划

时间:2016-09-26 03:53:44

标签: amazon-redshift

我在Redshift中有两个表,我正在尝试根据用户规范化的IP地址加入以获取邮政编码人口统计信息。通过标准化地址,我的意思是它与一个统一长度的字符串一致,该字符串具有被剥离的周期并且可以直接相互比较。例如,这在任何连接完成之前应用于所有ips并存储在表中:

lpad(split_part(ip, '.', 1), 3, '0') ||
lpad(split_part(ip, '.', 2), 3, '0') ||
lpad(split_part(ip, '.', 3), 3, '0') ||
lpad(split_part(ip, '.', 4), 3, '0')

因此209.170.151.71会转换为209170151071

我有两张桌子。第一个是visitor_details,其中包含以下内容:

-----------------------------
| visitor_id |      ip      |
-----------------------------
|      1     | 209170151071 |
|      2     | 123170167071 |
      ...           ...
| 50000000   | 001213020341 |
-----------------------------

我有一个名为geo_ip的表,其结构如下:

----------------------------------------
|    start_ip |    end_ip      |  zip  |
----------------------------------------
|209170151071 | 209170151071   | 11101 |
|309170151071 | 409170151071   | 11102 |
      ...           ...           ...
|509170151071 | 609170151071   | 11103 |
----------------------------------------

我正在尝试运行以下查询:

WITH vd AS (
  SELECT visitor_id,
         ip_address as c_ip
  FROM dev.visitor_details
)
SELECT
  visitor_id,
  c_ip,
  g.*
FROM
  vd
JOIN
  dev.geo_ip g
  ON vd.c_ip BETWEEN g.startip and g.endip
LIMIT 500;

geo ip上的排序键是使用startip和endip的交错排序键。该表似乎也没有偏差。但是,运行查询会导致执行时间过长(从未完成)。看一下解释,我看到以下内容:

XN Limit  (cost=0.00..245.17 rows=500 width=238)
   ->  XN Nested Loop DS_BCAST_INNER  (cost=0.00..18442148764959.20 rows=37610983146614 width=238)
         Join Filter: ((("inner".startip)::text <= ("outer".ip_address)::text) AND (("inner".endip)::text >= ("outer".ip_address)::text))
         ->  XN Seq Scan on visitor_details  (cost=0.00..596971.20 rows=59697120 width=72)
         ->  XN Seq Scan on geo_ip g  (cost=0.00..56702.71 rows=5670271 width=166)
 ----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----

什么是奇怪的,如果我硬编码加入的IP地址,查询计划看起来很正常。

有人可以就如何优化表格设置的查询以使其高效运行提出任何建议吗?

更新

我做了第一个响应建议的更改,但我仍然看到嵌套循环。所有IP现在都是bigints,并删除了with语句。

explain SELECT 
    vd.visitor_id,
    vd.ip_address,
    gi.zip
FROM
dev.visitor_details2 vd
JOIN dev.geo_ip3 gi ON vd.ip BETWEEN gi.startip and gi.endip
LIMIT 500;


                                               QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------
 XN Limit  (cost=0.00..136.62 rows=500 width=51)
   ->  XN Nested Loop DS_BCAST_INNER  (cost=0.00..10276958524959.20 rows=37610983146614 width=51)
         Join Filter: (("inner".startip <= "outer".ip) AND ("inner".endip >= "outer".ip))
         ->  XN Seq Scan on visitor_details2 vd  (cost=0.00..596971.20 rows=59697120 width=52)
         ->  XN Seq Scan on geo_ip3 gi  (cost=0.00..56702.71 rows=5670271 width=23)
 ----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----
(6 rows)

更新2 以下是表定义以确认它们都是bigint:

master=# \d dev.visitor_details2;
          Table "dev.visitor_details2"
   Column   |          Type          | Modifiers 
------------+------------------------+-----------
 id         | integer                | not null
 visitor_id | character varying(108) | 
 ip         | bigint                 | 
 ip_address | character varying(192) | 
 domain     | integer                | 
Indexes:
    "visitor_details2_pkey" PRIMARY KEY, btree (id)

master=# \d dev.geo_ip3;
                Table "dev.geo_ip3"
    Column    |          Type          | Modifiers 
--------------+------------------------+-----------
 startip      | bigint                 | 
 endip        | bigint                 | 
 country      | character varying(16)  | 
 region       | character varying(32)  | 
 city         | character varying(32)  | 
 zip          | character varying(16)  | 
 latitude     | double precision       | 
 longitude    | double precision       | 
 areacode     | integer                | 
 metrocode    | integer                | 
 timezone     | character varying(32)  | 
 isp          | character varying(128) | 
 organization | character varying(128) | 
 netspeed     | character varying(32)  | 
 domain       | character varying(128) | 

1 个答案:

答案 0 :(得分:0)

我不确定你为什么在这里使用with声明。你看过文件了吗? 我想正在发生的事情是,它正在access_detail表中的每个条目的with块中执行查询,然后必须将其广播到另一个节点XN Nested Loop DS_BCAST_INNER。 您还可以看到它正在加入文字Join Filter: ((("inner".startip)::text。 您应该考虑将ip_address的数据类型更改为BIGINT

我会用这种方式编写查询:

SELECT 
    vd.visitor_id,
    vd.ip_address,
    gi.zip
FROM
dev.visitor_details vd
JOIN geo_ip gi ON vd.ip_address BETWEEN gi.start_ip and gi.end_ip
LIMIT 500;

<强>更新

看起来Redshift handels加入'之间'的方式相当昂贵。您是否考虑过在这些范围内明确添加所有IP地址并使用ip_address作为排序键?我知道这个表的行数会变得很大,但是如果你使用适当的压缩(对于ip的DELTA32K和用于zip的runlength)以及对所有节点的分发,这可能是一个解决方案。