代码
以下代码计算线性回归与斜率数据的斜率和截距。然后,它将等式y = mx + b
应用于相同的结果集,以计算每行的回归线的值。
如何连接两个查询以便计算数据及其斜率/截距而不执行WHERE
子句两次?
问题的一般形式是:
SELECT a.group, func(a.group, avg_avg)
FROM a
(SELECT AVG(field1_avg) as avg_avg
FROM (SELECT a.group, AVG(field1) as field1_avg
FROM a
WHERE (SOME_CONDITION)
GROUP BY a.group) as several_lines -- potentially
) as one_line -- always
WHERE (SOME_CONDITION)
GROUP BY a.group -- again, potentially several lines
我SOME_CONDITION
执行了两次。如下所示(使用STRAIGHT_JOIN
优化更新):
SELECT STRAIGHT_JOIN
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
ymxb.SLOPE,
ymxb.INTERCEPT,
ymxb.CORRELATION,
ymxb.MEASUREMENTS
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D,
(SELECT
SUM(MEASUREMENTS) as MEASUREMENTS,
((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
(sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
(stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
FROM (
SELECT STRAIGHT_JOIN
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
) t
) ymxb
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
问题
如何每次查询只执行一次重复位,而不是两次?重复的代码:
$X{ IN, C.ID, CityCode } AND
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
Y.STATION_DISTRICT_ID = SD.ID AND
Y.YEAR BETWEEN 1900 AND 2009 AND
M.YEAR_REF_ID = Y.ID AND
M.CATEGORY_ID = $P{CategoryCode} AND
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
更新1
使用变量并拆分查询似乎允许缓存启动,因为它现在在3.5秒内运行,而它曾经在7中运行。但是,如果有任何方法可以删除重复的代码,我会感谢任何帮助。
<击> 更新2
上面的代码不能在JasperReports中运行,而VIEW虽然可能是一个修复,但效率可能非常低(因为WHERE子句是参数化的)。
击>
更新3
使用Unreason对具有收敛经络的毕达哥拉斯公式的建议来验证距离:
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) )
(这与问题无关,但其他人想知道......)
更新4
如图所示,代码在JasperReports中运行,针对MySQL数据库运行。 JasperReports不允许变量或多个查询。
更新5
我正在寻找一个干净利落的解决方案。 ;-)我已经写了一些部分工作的解决方案,但遗憾的是,MySQL不理解部分正确的。请参阅与Unreason的讨论,了解几乎可行的答案。
更新6
我或许能够重用第一个WHERE
子句中的变量并将它们与第二个进行比较(从而消除一些重复 - 对$P{}
值的检查),但我真的希望删除重复。
更新7
比较前一次更新中假设的YEAR
子句,以消除重复的BETWEEN
,不起作用。
相关
How to eliminate duplicate calculation in SQL?
谢谢!
答案 0 :(得分:5)
您应该能够一次性获得所需的一切:
SELECT
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
((avg(AVG(D.AMOUNT) * Y.YEAR)) - avg(AVG(D.AMOUNT)) * avg(Y.YEAR)) /
(stddev( AVG(D.AMOUNT) ) * stddev( Y.YEAR )) as CORRELATION,
((sum(Y.YEAR) * sum(AVG(D.AMOUNT))) - (count(1) * sum(Y.YEAR * AVG(D.AMOUNT)))) /
(power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as SLOPE,
((sum( Y.YEAR ) * sum( Y.YEAR * AVG(D.AMOUNT) )) -
(sum( AVG(D.AMOUNT) ) * sum(power(Y.YEAR, 2)))) /
(power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as INTERCEPT
FROM
CITY C,
STATION S,
YEAR_REF Y,
MONTH_REF M,
DAILY D
WHERE
$X{ IN, C.ID, CityCode } AND
SQRT(
POW( C.LATITUDE - S.LATITUDE, 2 ) +
POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < $P{Radius} AND
S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
Y.YEAR BETWEEN 1900 AND 2009 AND
M.YEAR_REF_ID = Y.ID AND
M.CATEGORY_ID = $P{CategoryCode} AND
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
将无法直接从上面的查询中运行(它具有无意义的组合聚合和其他错误);这是检查公式的好时机
如果您决定进行子查询,请简化公式,然后:
答案 1 :(得分:1)
这个问题比你的概括要困难一些。我会说如下:
SELECT a.group, func(a.group, avg_avg)
FROM a
(SELECT AVG(field1_avg) as avg_avg
FROM (SELECT a.group, AVG(field1) as field1_avg
FROM a
WHERE (YOUR_CONDITION)
GROUP BY a.group) as several_lines -- potentially
) as one_line -- always
WHERE (YOUR_CONDITION)
GROUP BY a.group -- again, potentially several lines
您有一个数据子集(受您的条件限制),该数据被分组并为每个组进行聚合。然后,将聚合合并到单个值,并且您希望再次将值的函数应用于每个组。显然,在分组子查询的结果可以作为实体引用之前,您不能重用该条件。
在MSSQL和Oracle中,您将使用WITH
运算符。在MySQL中,唯一的选择是使用临时表。我假设您的报告中有一年以上(否则,查询会更简单)。
UPD :很抱歉,我现在无法发布现成的代码(可以明天发布),但我有个主意:
您可以将子查询中需要输出的数据与GROUP_CONCAT
连接起来,并使用FIND_IN_SET
和SUBSTRING_INDEX
函数将其拆分回外部查询中。外部查询将只加入YEAR_REF和聚合结果。
外部查询中的条件将只是WHERE FIND_IN_SET(year, concatenated_years)
。
<强> UPD 强>:
以下是使用GROUP_CONCAT将所需数据传递到外部JOIN的版本。
我的评论以--newtover:
开头。顺便说一下,1)我不认为STRAIGHT_JOIN会增加任何好处,2)COUNT(*)
在MySQL中有特殊含义,而应该在你想要计算行时使用。
SELECT STRAIGHT_JOIN
-- newtover: extract the corresponding amount back
SUBSTRING_INDEX(SUBSTRING_INDEX(GROUPED_AMOUNTS, '|', @pos),'|', -1) as AMOUNT,
Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
ymxb.SLOPE,
ymxb.INTERCEPT,
ymxb.CORRELATION,
ymxb.MEASUREMENTS
FROM
-- newtover: list of tables now contains only the subquery, YEAR_REF for grouping and init_vars to define the variable
YEAR_REF Y,
(SELECT
SUM(MEASUREMENTS) as MEASUREMENTS,
((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
(sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
(stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION,
-- newtover: grouped fields for matching years and the corresponding amounts
GROUP_CONCAT(Y.YEAR) as GROUPED_YEARS,
GROUP_CONCAT(AMOUNT SEPARATOR '|') as GROUPED_AMOUNTS
FROM (
SELECT STRAIGHT_JOIN
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
) t
) ymxb,
(SELECT @pos:=NULL) as init_vars
WHERE
-- newtover: check if the year is in the list and store the index into the variable
@pos:=CAST(FIND_IN_SET(Y.YEAR, GROUPED_YEARS) as UNSIGNED)
GROUP BY
Y.YEAR
答案 2 :(得分:0)
由于问题中的SQL被大幅挂起(现在只显示相关部分),这是我的新答案
假设:条件实际上是相同的,子查询和外部查询之间没有棘手的列别名
答案: 您可以删除外部查询中的位置。
SELECT
/* aggregate data */
ymxb.*
FROM (
SELECT
/* similar aggregate data */
WHERE
/* some condition */
GROUP BY
YEAR
) ymxb
GROUP BY
YEAR
这应该会给你相同的结果。
(另请注意,您可以删除内部位置并保留外部结果 - 结果应该相同,但性能可能不同。)
最后,重复where子句可能对性能没有太大影响 - 评估额外条件(甚至表达式,如sqrt等)与任何I / O相比都非常便宜(这些条件不能在任何新列上运行,因此所有I / O都已完成)
此外,您的内部查询和外部查询使用相同的GROUP BY,外部查询从子查询获取所有数据。
这使得外部查询中的任何聚合函数都是冗余的(来自子查询的行,它们是外部查询的源,已按年分组)。
这使整个子选择变得多余。
答案 3 :(得分:0)
您是否可以在您的情况下使用临时表?虽然它仍然需要你两次使用WHERE子句,但它应该会大大提高你的性能。
DROP TEMPORARY TABLE IF EXISTS TEMP_DATA
CREATE TEMPORARY TABLE TEMP_DATA
(SELECT AVG(field1_avg) as avg_avg
FROM (SELECT a.group, AVG(field1) as field1_avg
FROM a
WHERE (SOME_CONDITION)
GROUP BY a.group)
)
SELECT t.group, func(t.group, t.avg_avg)
FROM TEMP_DATA AS t
WHERE (SOME_CONDITION)
GROUP BY t.group
希望这有帮助! --Dubs