优化大型子表的日期查询:GiST还是GIN?

时间:2010-05-20 04:43:09

标签: sql postgresql date query-optimization full-table-scan

问题

72个子表,每个子表具有年份索引和站点索引,定义如下:

CREATE TABLE climate.measurement_12_013
(
-- Inherited from table climate.measurement_12_013:  id bigint NOT NULL DEFAULT nextval('climate.measurement_id_seq'::regclass),
-- Inherited from table climate.measurement_12_013:  station_id integer NOT NULL,
-- Inherited from table climate.measurement_12_013:  taken date NOT NULL,
-- Inherited from table climate.measurement_12_013:  amount numeric(8,2) NOT NULL,
-- Inherited from table climate.measurement_12_013:  category_id smallint NOT NULL,
-- Inherited from table climate.measurement_12_013:  flag character varying(1) NOT NULL DEFAULT ' '::character varying,
  CONSTRAINT measurement_12_013_category_id_check CHECK (category_id = 7),
  CONSTRAINT measurement_12_013_taken_check CHECK (date_part('month'::text, taken)::integer = 12)
)
INHERITS (climate.measurement)

CREATE INDEX measurement_12_013_s_idx
  ON climate.measurement_12_013
  USING btree
  (station_id);
CREATE INDEX measurement_12_013_y_idx
  ON climate.measurement_12_013
  USING btree
  (date_part('year'::text, taken));

(稍后要添加的外键约束。)

由于全表扫描,以下查询运行速度非常慢:

SELECT
  count(1) AS measurements,
  avg(m.amount) AS amount
FROM
  climate.measurement m
WHERE
  m.station_id IN (
    SELECT
      s.id
    FROM
      climate.station s,
      climate.city c
    WHERE
        /* For one city... */
        c.id = 5182 AND

        /* Where stations are within an elevation range... */
        s.elevation BETWEEN 0 AND 3000 AND

        /* and within a specific radius... */
        6371.009 * SQRT( 
          POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
            (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
              POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
        ) <= 50
    ) AND

  /* Data before 1900 is shaky; insufficient after 2009. */
  extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

  /* Whittled down by category... */
  m.category_id = 1 AND

  /* Between the selected days and years... */
  m.taken BETWEEN
   /* Start date. */
   (extract( YEAR FROM m.taken )||'-01-01')::date AND
    /* End date. Calculated by checking to see if the end date wraps
       into the next year. If it does, then add 1 to the current year.
    */
    (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
      sign(
        (extract( YEAR FROM m.taken )||'-12-31')::date -
        (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
    ) AS text)||'-12-31')::date
GROUP BY
  extract( YEAR FROM m.taken )

缓慢来自查询的这一部分:

  m.taken BETWEEN
    /* Start date. */
  (extract( YEAR FROM m.taken )||'-01-01')::date AND
    /* End date. Calculated by checking to see if the end date wraps
      into the next year. If it does, then add 1 to the current year.
    */
    (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
      sign(
        (extract( YEAR FROM m.taken )||'-12-31')::date -
        (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
    ) AS text)||'-12-31')::date

此部分查询与选择的天数相匹配。例如,如果用户想要在6月1日到7月1日期间查看有数据的所有年份的数据,则上述条款仅与那些日期相匹配。如果用户想要查看12月22日到3月22日之间的数据,再次对有数据的所有年份,上述条款计算3月22日是在12月22日的下一年,并因此匹配日期:< / p>

目前日期固定为1月1日至12月31日,但将进行参数化,如上所示。

计划中的HashAggregate显示成本为10006220141.11,我怀疑这是天文数据巨大的一面。

在测量表(本身既没有数据也没有索引)上执行全表扫描。该表汇总了其子表中的2.73亿行。

问题

索引日期以避免全表扫描的正确方法是什么?

我考虑的选项:

  • GIN
  • GiST的
  • 重写WHERE子句
  • 将year_taken,month_taken和day_taken列分隔为表格

你有什么想法?

谢谢!

3 个答案:

答案 0 :(得分:2)

您的问题是您有一个where子句,具体取决于日期的计算。如果数据库需要获取每一行并在知道日期是否匹配之前对其进行计算,则数据库无法使用索引。

除非您将其重写为数据库具有固定范围的格式,以检查哪个不依赖于要检索的数据,否则您将始终必须扫描该表。

答案 1 :(得分:1)

尝试这样的事情:

create temporary table test (d date);

insert into test select '1970-01-01'::date+generate_series(1,50*365);

analyze test

create function month_day(d date) returns int as $$
  select extract(month from $1)::int*100+extract(day from $1)::int $$
language sql immutable strict;

create index test_d_month_day_idx on test (month_day(d));

explain analyze select * from test
  where month_day(d)>=month_day('2000-04-01')
  and month_day(d)<=month_day('2000-04-05');

答案 2 :(得分:0)

我认为要在这些分区中有效地运行此操作,我会让您的应用程序在日期范围内变得更加智能。让它生成一个实际的日期列表来检查每个分区,然后让它在分区之间生成一个UNION查询。听起来你的数据集非常静态,因此日期索引上的CLUSTER也可以大大提高性能。