每年汇总不同的日期间隔

时间:2018-02-09 19:34:31

标签: mysql sql

我有很多商店,我想将今年到目前为止的能耗与去年同期相比。我的挑战是,在当年,商店在交付数据方面有不同的日期间隔。这意味着商店A的数据可能介于01.01.2018和20.01.2018之间,商店B的数据可能介于01.01.2018和28.01.2018之间。我想将当年与上一年的日期间隔相加。

数据看起来像这样

Store   Date    Sum
A   01.01.2018  12
A   20.01.2018  11
B   01.01.2018  33
B   28.01.2018  32

但是数百万行并且会使用这些日期作为参考来获得前一年的相同金额。

这是我的(错误的)尝试:

SET @curryear = (SELECT YEAR(MAX(start_date)) FROM energy_data);
SET @maxdate_curryear = (SELECT MAX(start_date) FROM energy_data WHERE 
YEAR(start_date) = @curryear);
SET @mindate_curryear = (SELECT MIN(start_date) FROM energy_data WHERE 
YEAR(start_date) = @curryear);

-- the same date intervals last year

SET @maxdate_prevyear = (@maxdate_curryear - INTERVAL 1 YEAR); 
SET @mindate_prevyear = (@mindate_curryear - INTERVAL 1 YEAR); 

-- sums current year

CREATE TABLE t_sum_curr AS
SELECT name as name_curr, sum(kwh) as sum_curr, min(start_date) AS 
min_date_curr, max(start_date) AS max_date_curr, count(distinct 
start_date) AS ant_timer FROM energy_data WHERE agg_type = 'timesnivå' 
AND start_date >= @mindate_curryear and start_date <= @maxdate_curryear GROUP BY NAME; 

-- also seems fair, the same dates one year ago, figured I should find those first and in the next query use that to sum each stores between those date intervals

CREATE TABLE t_sum_prev AS
SELECT name_curr as name_curr2, (min_date_curr - INTERVAL 1 YEAR) AS 
min_date_prev, (max_date_curr - INTERVAL 1 YEAR) as max_date_prev FROM 
t_sum_curr;

-- getting into trouble!

CREATE TABLE the_results AS
SELECT name, start_date, sum(kwh) as sum_prev from energy_data where 
agg_type = 'timesnivå' and
            start_date >= @mindate_prevyear and start_date <= 
@maxdate_prevyear group by name having start_date BETWEEN (SELECT 
min_date_prev from t_sum_prev) AND                                                                      
(SELECT max_date_prev from t_sum_prev);

` 最后一个查询告诉我,我的子查询返回多行并抛出错误消息。

1 个答案:

答案 0 :(得分:0)

我假设你所拥有的是能源消耗数据列表,其中账单或读数是在不规则时间进行的,因此消费包括不规则的时期。

您需要采取的基本方法是规范消费期 - 通过确定每个期间涵盖的天数,然后将每个读数分解为涵盖的天数,并将每天的消费量作为每日平均值。时期。

我假设消费期是完全连续的(通常是账单或读数),而不是重叠。

由于所涉及的行数量很大(即使以当前形式表示数百万),您可能也不希望以日常形式保留数据 - 将它们重新组合为常规的每周,每月或每季度,这取决于在比较所需的粒度级别。

一旦你有规律的时期,比较就像蛋糕一样容易。

如果这是将持续运行的报告的一部分,您可能希望实施一些计算&#34;正规化消费的逻辑&#34;逐步地并按计划将其存储在摘要表中,并使用适当的列和索引,这样您就不必在每次运行报表时处理数百万个历史行。

尝试使用花哨的连接和动态平均值来解决不规则时期(如果确实可以完成),而不是直接解决它们,可能会导致非常困难的逻辑,特别是在数据集上这种规模,可怕的表现。

编辑:来自以下评论。

@Alexander,我把一个查询的例子拼凑在了一起。我还没有对它进行测试,而且我已经在文本编辑器中编写了所有内容,所以请原谅任何小的语法错误。我想出的东西看起来有点复杂(比我开始时想象的要复杂得多),但我也有点累,所以我不确定它是否可以进一步简化

我要做的唯一一点就是这个查询(或任何这样的查询)的性能,因为它在遍历日期范围时必须做的事情,在一个有数百万行的表上可能会令人震惊。我支持我之前的评论,即对源数据进行适当的索引将是至关重要的,并且将源数据概括为更大的粒度将大大有助于性能(以一次性命中为代价来总结它)。即使是每日粒度,也会将行数减少24倍!

WITH energy_data_ext AS
(
    SELECT
        ed.name                 AS store_name
        ,YEAR(ed.start_date)    AS reading_year
        ,ed.start_date          AS reading_date
        ,ed.kwh                 AS reading_kwh
    FROM
        energy_data AS ed
)

,available_stores AS
(
    SELECT ede.store_name
    FROM energy_data_ext AS ede
    GROUP BY ede.store_name
)

,current_reading_yr_per_store AS
(
    SELECT
        ede.store_name
        ,MAX(ede.reading_year)  AS current_reading_year
    FROM
        energy_data_ext AS ede
    GROUP BY 
        ede.store_name
)

,latest_reading_ranges_per_year AS
(
    SELECT
        ede.store_name
        ,ede.reading_year
        ,MAX(ede.start_date) AS latest_reading_date_of_yr
    FROM
        energy_data_ext AS ede
    GROUP BY
        ede.store_name
        ,ede.reading_year
)

,store_reading_ranges AS
(
    SELECT
        avs.store_name
        ,lryps.current_reading_year
        ,lyrr.latest_reading_date_of_yr AS current_year_latest_reading_date

        ,(lryps.current_reading_year - 1)                   AS prev_reading_year
        ,(lyrr.latest_reading_date_of_yr - INTERVAL 1 YEAR) AS prev_year_latest_reading_date

    FROM
        available_stores AS avs

    LEFT JOIN
        current_reading_yr_per_store AS lryps
        ON (lryps.store_name = avs.store_name)

    LEFT JOIN
        latest_reading_ranges_per_year AS lyrr
        ON (lyrr.store_name = avs.store_name)
        AND (lyrr.reading_year = lryps.current_reading_year)
)

--at this stage, we should have all the calculations we need to 
--establish the range for the latest year, and the range for the year prior to that

,current_year_consumption AS
(
    SELECT
        avs.store_name
        SUM(cyed.reading_kwh) AS latest_year_kwh

    FROM
        available_stores AS avs

    LEFT JOIN
        store_reading_ranges AS srs
        ON (srs.store_name = avs.store_name)

    LEFT JOIN
        energy_data_ext AS cyed
        ON (cyed.reading_year = srs.current_reading_year)
        AND (cyed.reading_date <= srs.current_year_latest_reading_date)

    GROUP BY
        avs.store_name
)

,prev_year_consumption AS
(
    SELECT
        avs.store_name
        SUM(pyed.reading_kwh) AS prev_year_kwh

    FROM
        available_stores AS avs

    LEFT JOIN
        store_reading_ranges AS srs
        ON (srs.store_name = avs.store_name)

    LEFT JOIN
        energy_data_ext AS pyed
        ON (pyed.reading_year = srs.prev_reading_year)
        AND (pyed.reading_date <= srs.prev_year_latest_reading_date)

    GROUP BY
        avs.store_name
)

SELECT
    avs.store_name

    ,srs.current_reading_year
    ,srs.current_year_latest_reading_date
    ,lyc.latest_year_kwh

    ,srs.prev_reading_year
    ,srs.prev_year_latest_reading_date
    ,pyc.prev_year_kwh

FROM
    available_stores AS avs

LEFT JOIN
    store_reading_ranges AS srs
    ON (srs.store_name = avs.store_name)

LEFT JOIN
    current_year_consumption AS lyc
    ON (lyc.store_name = avs.store_name)

LEFT JOIN
    prev_year_consumption AS pyc
    ON (pyc.store_name = avs.store_name)