Question

我有以下数据框，用于存储学生对每个问题的正确尝试，其中“ 1”代表正确尝试，“ 0”代表错误尝试，如下所示：

structure(list(X1 = c(1, 1), X2 = c(0, 0), X3 = c(1, 1), X4 = c(1, 
0), X5 = c(1, 1), X6 = c(1, 1), X7 = c(1, 1), X8 = c(0, 0), X9 = c(0, 
0), X10 = c(1, 1), X11 = c(1, 1), X12 = c(0, 0), X13 = c(0, 1
), X14 = c(0, 0), X15 = c(0, 0), X16 = c(1, 1), X17 = c(1, 1), 
X18 = c(0, 0), X19 = c(1, 1), X20 = c(0, 0), X21 = c(1, 1
), X22 = c(1, 1), X23 = c(1, 1), X24 = c(1, 1), X25 = c(1, 
1), X26 = c(1, 1), X27 = c(1, 1), X28 = c(0, 0), X29 = c(1, 
1), X30 = c(1, 1), X31 = c(1, 1), X32 = c(0, 0), X33 = c(1, 
1)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))

我对这个问题感兴趣：“假设学生错误回答了问题1，那么他也错误回答了问题2的概率是多少？”或更笼统地说，他也错误回答Qi的概率是多少？

最好将这些条件概率表示在一个矩阵中，其中ij项是由于他错误回答i问题而错误回答j问题的概率。

关于实现此算法的我的基本想法是（对于第i个问题）： 1.子集第i个条目为0的所有行 2.计算子矩阵中每个j问题的'0'比例 3.将结果作为向量返回 4.对所有i重复1-3，并将这些向量rbind到矩阵中。

但是有没有更快的方法来实现我想要的？

Answer 1

您的算法很有意义；我看不到更好的方法。这是使用dplyr包的实现，它简化了checkit函数。

set.seed(34342)
# simulate some data--100 students across 33 questions
x <- data.frame(matrix(sample(c(0,1),3300,replace=T),nrow=100))
# invert x to show incorrect as 1--can then use simple sums
x <- (-x + 1)
checkit <- function(x,n) {
    # filter out students with incorrect for question n and calculate probs
    return(x %>% filter(.,.[,n]==1) %>% {colSums(.)/nrow(.)})
}
# set up destination matrix
probs <- matrix(numeric(33*33), nrow=ncol(x))
# fill it line by line
for (i in 1:33) {
    probs[i,] <- checkit(x,i)
}

这在MacBookAir6,2（2013年中）上模拟了10000名学生，平均时间为157毫秒。

使用R中的条件概率分析学生的测试结果

1 个答案: