Question

我有四张MySQL表：

用户（身份证，姓名）
民意调查（身份证明，文字）
选项（id，poll_id，text）
回复（id，poll_id，option_id，user_id）

鉴于特定的民意调查和特定选项，我想生成一个表格，显示其他民意调查中哪些选项的关联性最强。

假设这是我们的数据集：

TABLE users:
+------+-------+
| id   | name  |
+------+-------+
|    1 | Abe   |
|    2 | Bob   |
|    3 | Che   |
|    4 | Den   |
+------+-------+

TABLE polls:
+------+-----------------------+
| id   | text                  |
+------+-----------------------+
|    1 | Do you like apples?   |
|    2 | What is your gender?  |
|    3 | What is your height?  |
|    4 | Do you like polls?    |
+------+-----------------------+

TABLE options:

+------+----------+---------+
| id   | poll_id  | text    |
+------+----------+---------+
|    1 | 1        | Yes     |
|    2 | 1        | No      |
|    3 | 2        | Male    |
|    4 | 2        | Female  |
|    5 | 3        | Short   |
|    6 | 3        | Tall    |
|    7 | 4        | Yes     |
|    8 | 4        | No      |
+------+----------+---------+

TABLE responses:

+------+----------+------------+----------+
| id   | poll_id  | option_id  | user_id  |
+------+----------+------------+----------+
|    1 | 1        | 1          | 1        |
|    2 | 1        | 2          | 2        |
|    3 | 1        | 2          | 3        |
|    4 | 1        | 2          | 4        |
|    5 | 2        | 3          | 1        |
|    6 | 2        | 3          | 2        |
|    7 | 2        | 3          | 3        |
|    8 | 2        | 4          | 4        |
|    9 | 3        | 5          | 1        |
|   10 | 3        | 6          | 2        |
|   10 | 3        | 5          | 3        |
|   10 | 3        | 6          | 4        |
|   10 | 4        | 7          | 1        |
|   10 | 4        | 7          | 2        |
|   10 | 4        | 7          | 3        |
|   10 | 4        | 7          | 4        |
+------+----------+------------+----------+

鉴于轮询ID 1和选项ID 2，生成的表应该是这样的：

+----------+------------+-----------------------+
| poll_id  | option_id  | percent_correlated    |
+----------+------------+-----------------------+
| 4        | 7          | 100                   |
| 2        | 3          | 66.66                 |
| 3        | 6          | 66.66                 |
| 2        | 4          | 33.33                 |
| 3        | 5          | 33.33                 |
| 4        | 8          | 0                     |
+----------+------------+-----------------------+

基本上，我们确定了所有响应投票ID 1和所选选项ID 2的用户，我们正在查看所有其他民意调查，看看他们中有多少百分比也选择了其他选项。

Answer 1

没有方便测试的实例，你能看出这是否得到了正确的结果：

select
        poll_id,
        option_id,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
        n
from
(
    select 
        poll_id,
        option_id,
        SUM(score) AS sum1,
        SUM(score_rev) AS sum2,
        SUM(score * score) AS sum1sq,
        SUM(score_rev * score_rev) AS sum2sq,
        SUM(score * score_rev) AS psum,
        COUNT(*) AS n
    from
    (
            select 
                responses.poll_id, 
                responses.option_id,
                CASE 
                    WHEN user_resp.user_id IS NULL THEN SELECT 0
                    ELSE SELECT 1
                END CASE as score,
                CASE 
                    WHEN user_resp.user_id IS NULL THEN SELECT 1
                    ELSE SELECT 0
                END CASE as score_rev,
            from responses left outer join 
                    (
                        select 
                            user_id
                        from 
                            responses 
                        where
                            poll_id = 1 and 
                            option_id = 2
                    )user_resp  
                        ON (user_resp.user_id = responses.user_id)
    ) temp1 
    group by
        poll_id,
        option_id
)components

Answer 2

经过几个小时的反复试验，我设法将一个正常运行的查询放在一起：

SELECT poll_id AS p_id, 
       option_id AS o_id, 
       COUNT(*) AS optCount, 

       (SELECT COUNT(*) FROM response WHERE option_id = o_id AND user_id IN 
          (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) /
       (SELECT COUNT(*) FROM response WHERE poll_id = p_id  AND user_id IN 
          (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) 
       AS percentage 

FROM response 
INNER JOIN 
   (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2') AS user_ids
ON response.user_id = user_ids.user_id
WHERE poll_id != '1' 

GROUP BY option_id DESC 
ORDER BY percentage DESC, optCount DESC

基于具有小数据集的测试，此查询看起来相当快，但我想修改它以使“IN”子查询不重复三次。有什么建议吗？

Answer 3

这似乎给了我正确的结果：

select poll_stats.poll_id,
       option_stats.option_id,
       (100 * option_responses / poll_responses) as percent_correlated
from (select response.poll_id,
             count(*) as poll_responses
      from response selecting_response
           join response on response.user_id = selecting_response.user_id
      where selecting_response.poll_id = 1 and selecting_response.option_id = 2
      group by response.poll_id) poll_stats
      join (select options.poll_id,
                   options.id as option_id,
                   count(response.id) as option_responses
            from options
                 left join response on response.poll_id = options.poll_id
                           and response.option_id = options.id
                           and exists (
                            select 1 from response selecting_response
                            where selecting_response.user_id = response.user_id
                                  and selecting_response.poll_id = 1
                                  and selecting_response.option_id = 2)
            group by options.poll_id, options.id
           ) as option_stats
       on option_stats.poll_id = poll_stats.poll_id
where poll_stats.poll_id <> 1
order by 3 desc, option_responses desc

高级MySQL：查找轮询响应之间的相关性

3 个答案: