MySQL:在FROM子句中用相关子查询重写MSSQL?

时间:2009-11-20 15:49:21

标签: mysql sql database distinct correlated-subquery

我们有一个包含网站页面浏览量的表格,例如:

time      | page_id
----------|-----------------------------
1256645862| pageA
1256645889| pageB
1256647199| pageA
1256647198| pageA
1256647300| pageB
1257863235| pageA
1257863236| pageC

在我们的生产表中,目前大约有40K行。我们希望每天生成在过去30天,60天和90天内查看的唯一页数。因此,在结果集中,我们可以查找一天,并查看在该日之前的60天内访问了多少唯一页面。

我们能够在MSSQL中使用查询:

SELECT DISTINCT
 CONVERT(VARCHAR,P.NDATE,101) AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-29,P.NDATE) AND P.NDATE) AS SUB) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-59,P.NDATE) AND P.NDATE) AS SUB) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-89,P.NDATE) AND P.NDATE) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'

注意:因为MSSQL没有FROM_UNIXTIME函数,所以我们添加了用于测试的NDATE列,它只是转换后的time。生产表中不存在NDATE。

将此查询转换为MySQL会给我们带来“Unknown colum P.time”错误:

SELECT DISTINCT
 FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'

我理解这是因为我们不能有一个引用外部FROM子句中的表的相关子查询。但是,遗憾的是,我们对如何将此查询转换为在MySQL中工作感到茫然。现在,我们只是从表中返回所有DISTINCT行,并在PHP中对其进行后处理。 40K行需要2-3秒。当我们拥有1000行的100行时,我很担心性能。

可以在MySQL中做到吗?如果是这样,我们是否可以期望它的性能优于我们的PHP后处理解决方案。

更新 这是创建表的查询:

CREATE TABLE  `perflog` (
    `user_id` VARBINARY( 40 ) NOT NULL ,
    `elapsed` float UNSIGNED NOT NULL ,
    `page_id` VARCHAR( 255 ) NOT NULL ,
    `time` INT( 10 ) UNSIGNED NOT NULL ,
    `ip` VARBINARY( 40 ) NOT NULL ,
    `agent` VARCHAR( 255 ) NOT NULL ,
    PRIMARY KEY (  `user_id` ,  `page_id` ,  `time` ,  `ip`,  `agent` )
) ENGINE MyISAM

到目前为止,我们的生产表有大约40K行!

5 个答案:

答案 0 :(得分:1)

注意:我在阅读@astander,@ Donnie,@ longneck的解决方案之后写这篇文章。

我知道性能很重要,但为什么不存储聚合?每行十年的行数是3650行,每行只有几列。

TABLE dimDate (DateKey int (PK), Year int, Day int, DayOfWeek varchar(10), DayInEpoch....)
TABLE AggVisits (DateKey int (PK,FK), Today int, Last30 int, Last60 int, Last90 int)

这样,您只能在一天结束时运行一次查询,仅运行一天。预先计算的聚合是任何高性能分析解决方案(多维数据集)的根源。

<强>更新
您可以通过引入另一列DayInEpoch int(自1990-01-01以来的日期编号)来加快这些查询的速度。然后,您可以删除所有这些日期/时间转换功能。

答案 1 :(得分:0)

这是我用来解决这个问题的PHP。理想情况下,我希望这一切都由MySQL完成(如果可以更快地完成)。我只发布这个作为对任务的进一步澄清:

function getUniqueUsage($field = 'page_id', $since = 90){
    //we need to add 90 days onto our date range for the 90-day sum
    $sinceSeconds = mktime(0, 0, 0, $m , $d, $y) - (($sinceDays + 90) * (60 * 60 * 24));
    //==> omitting mySQL connection details<==
    $sql = "SELECT DISTINCT From_unixtime(time,'%Y-%m-%d') AS date, $field FROM perflog WHERE time > $sinceSeconds ORDER BY date" ;
    $sql_results = mysql_query($sql);
    $results = array();
    //all page ids per date (ending-up with only unique date keys)
    while ($row = mysql_fetch_assoc($sql_results))
    {
        $results[$row['date']][] = $row[$field];
    }
    $sums = array();
    //initialize sum array, with only unique dates (days)
    foreach (array_keys($results) as $date){
        $sums[$date] = array(0,0,0);
    }
    //calculate the 30/60/90 day unique pages for each day
    foreach (array_keys($sums) as $ref_date){
        $merges30 = array();
        $merges60 = array();
        $merges90 = array();
        $ref_time = strtotime($ref_date);
        $ref_minus_30 = strtotime("-30 Days",$ref_time);
        $ref_minus_60 = strtotime("-60 Days",$ref_time);
        $ref_minus_90 = strtotime("-90 Days",$ref_time);
        foreach ($results as $result_date => $pages){
            $compare_time = strtotime($result_date);
            if ($compare_time >= $ref_minus_30 && $compare_time <= $ref_time){
                $merges30 = array_merge($merges30, $pages);
            }
            if ($compare_time >= $ref_minus_60 && $compare_time <= $ref_time){
                $merges60 = array_merge($merges60, $pages);
            }
            if ($compare_time >= $ref_minus_90 && $compare_time <= $ref_time){
                $merges90 = array_merge($merges90, $pages);
            }
        }
        $sums[$ref_date] = array(count(array_unique($merges30)),count(array_unique($merges60)),count(array_unique($merges90)));
    }
    //truncate to only specified number of days
    return array_slice($sums,-$since, $since, true);
}

正如您所看到的,很多不幸的数组合并和数组唯一性。

答案 2 :(得分:0)

为什么你把子查询埋在这样的第二层?试试这个:

SELECT DISTINCT
 FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'

答案 3 :(得分:0)

您可以尝试使用一个选择。

仅选择日期和前90天之间的值。

然后在每个fiels中使用case语句检查日期是否介于30,60,90之间。如果大小写为真,则每个字段为1,否则为0,并计算它们。

这样的东西
SELECT  SUM(CASE WHEN p.Date IN 30 PERIOD THEN 1 ELSE 0 END) Cnt30,
        SUM(CASE WHEN p.Date IN 60 PERIOD THEN 1 ELSE 0 END) Cnt60,
        SUM(CASE WHEN p.Date IN 90 PERIOD THEN 1 ELSE 0 END) Cnt90
FROM    Table
WHERE p.Date IN 90 PERIOD

答案 4 :(得分:0)

将子选择更改为连接,如下所示:

select
  FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE',
  count(distinct p30.page_id) AS '30D',
  count(distinct p60.page_id) AS '60D',
  count(distinct p90.page_id) AS '90D'
from
  perflog p
  join perflog p30 on FROM_UNIXTIME(p30.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
  join perflog p60 on FROM_UNIXTIME(p60.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
  join perflog p90 on FROM_UNIXTIME(p90.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')

然而,由于大量函数会杀死日期列上的任何标记,因此可能会运行缓慢,更好的解决方案可能是:

create temporary table perf_tmp as
select
  FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'VIEWDATE',
  page_id
from
  perflog;

create index perf_dt on perf_tmp (VIEWDATE);

select
  VIEWDATE, 
  count(distinct p30.page_id) AS '30D',
  count(distinct p60.page_id) AS '60D',
  count(distinct p90.page_id) AS '90D'
from
  perf_tmp p
  join perf_tmp p30 on p30.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 30 DAY) AND p.VIEWDATE
  join perf_tmp p60 on p60.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 60 DAY) AND p.VIEWDATE
  join perf_tmp p90 on p90.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 90 DAY) AND p.VIEWDATE;