高级MySQL:查找轮询响应之间的相关性

时间:2011-03-15 13:08:49

标签: mysql sql correlation

我有四张MySQL表:

用户(身份证,姓名)
民意调查(身份证明,文字)
选项(id,poll_id,text)
回复(id,poll_id,option_id,user_id)

鉴于特定的民意调查和特定选项,我想生成一个表格,显示其他民意调查中哪些选项的关联性最强。

假设这是我们的数据集:

TABLE users:
+------+-------+
| id   | name  |
+------+-------+
|    1 | Abe   |
|    2 | Bob   |
|    3 | Che   |
|    4 | Den   |
+------+-------+

TABLE polls:
+------+-----------------------+
| id   | text                  |
+------+-----------------------+
|    1 | Do you like apples?   |
|    2 | What is your gender?  |
|    3 | What is your height?  |
|    4 | Do you like polls?    |
+------+-----------------------+

TABLE options:

+------+----------+---------+
| id   | poll_id  | text    |
+------+----------+---------+
|    1 | 1        | Yes     |
|    2 | 1        | No      |
|    3 | 2        | Male    |
|    4 | 2        | Female  |
|    5 | 3        | Short   |
|    6 | 3        | Tall    |
|    7 | 4        | Yes     |
|    8 | 4        | No      |
+------+----------+---------+

TABLE responses:

+------+----------+------------+----------+
| id   | poll_id  | option_id  | user_id  |
+------+----------+------------+----------+
|    1 | 1        | 1          | 1        |
|    2 | 1        | 2          | 2        |
|    3 | 1        | 2          | 3        |
|    4 | 1        | 2          | 4        |
|    5 | 2        | 3          | 1        |
|    6 | 2        | 3          | 2        |
|    7 | 2        | 3          | 3        |
|    8 | 2        | 4          | 4        |
|    9 | 3        | 5          | 1        |
|   10 | 3        | 6          | 2        |
|   10 | 3        | 5          | 3        |
|   10 | 3        | 6          | 4        |
|   10 | 4        | 7          | 1        |
|   10 | 4        | 7          | 2        |
|   10 | 4        | 7          | 3        |
|   10 | 4        | 7          | 4        |
+------+----------+------------+----------+

鉴于轮询ID 1和选项ID 2,生成的表应该是这样的:

+----------+------------+-----------------------+
| poll_id  | option_id  | percent_correlated    |
+----------+------------+-----------------------+
| 4        | 7          | 100                   |
| 2        | 3          | 66.66                 |
| 3        | 6          | 66.66                 |
| 2        | 4          | 33.33                 |
| 3        | 5          | 33.33                 |
| 4        | 8          | 0                     |
+----------+------------+-----------------------+

基本上,我们确定了所有响应投票ID 1和所选选项ID 2的用户,我们正在查看所有其他民意调查,看看他们中有多少百分比也选择了其他选项。

3 个答案:

答案 0 :(得分:1)

没有方便测试的实例,你能看出这是否得到了正确的结果:

select
        poll_id,
        option_id,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
        n
from
(
    select 
        poll_id,
        option_id,
        SUM(score) AS sum1,
        SUM(score_rev) AS sum2,
        SUM(score * score) AS sum1sq,
        SUM(score_rev * score_rev) AS sum2sq,
        SUM(score * score_rev) AS psum,
        COUNT(*) AS n
    from
    (
            select 
                responses.poll_id, 
                responses.option_id,
                CASE 
                    WHEN user_resp.user_id IS NULL THEN SELECT 0
                    ELSE SELECT 1
                END CASE as score,
                CASE 
                    WHEN user_resp.user_id IS NULL THEN SELECT 1
                    ELSE SELECT 0
                END CASE as score_rev,
            from responses left outer join 
                    (
                        select 
                            user_id
                        from 
                            responses 
                        where
                            poll_id = 1 and 
                            option_id = 2
                    )user_resp  
                        ON (user_resp.user_id = responses.user_id)
    ) temp1 
    group by
        poll_id,
        option_id
)components 

答案 1 :(得分:1)

经过几个小时的反复试验,我设法将一个正常运行的查询放在一起:

SELECT poll_id AS p_id, 
       option_id AS o_id, 
       COUNT(*) AS optCount, 

       (SELECT COUNT(*) FROM response WHERE option_id = o_id AND user_id IN 
          (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) /
       (SELECT COUNT(*) FROM response WHERE poll_id = p_id  AND user_id IN 
          (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) 
       AS percentage 

FROM response 
INNER JOIN 
   (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2') AS user_ids
ON response.user_id = user_ids.user_id
WHERE poll_id != '1' 

GROUP BY option_id DESC 
ORDER BY percentage DESC, optCount DESC

基于具有小数据集的测试,此查询看起来相当快,但我想修改它以使“IN”子查询不重复三次。有什么建议吗?

答案 2 :(得分:1)

这似乎给了我正确的结果:

select poll_stats.poll_id,
       option_stats.option_id,
       (100 * option_responses / poll_responses) as percent_correlated
from (select response.poll_id,
             count(*) as poll_responses
      from response selecting_response
           join response on response.user_id = selecting_response.user_id
      where selecting_response.poll_id = 1 and selecting_response.option_id = 2
      group by response.poll_id) poll_stats
      join (select options.poll_id,
                   options.id as option_id,
                   count(response.id) as option_responses
            from options
                 left join response on response.poll_id = options.poll_id
                           and response.option_id = options.id
                           and exists (
                            select 1 from response selecting_response
                            where selecting_response.user_id = response.user_id
                                  and selecting_response.poll_id = 1
                                  and selecting_response.option_id = 2)
            group by options.poll_id, options.id
           ) as option_stats
       on option_stats.poll_id = poll_stats.poll_id
where poll_stats.poll_id <> 1
order by 3 desc, option_responses desc