Question

我想找到一种方法来更快地计算成对准确性，即将比较同一数组中的元素（在本例中为panda df列），计算它们之间的差异，然后比较所获得的两个结果。我将有一个数据框 df ，其中包含3列（文档的 id ， Jugment ），它们代表人工评估，并且是一个int对象， PR_score 代表该文档的pagerank，它是一个float对象），我想检查一下他们是否同意对一个文档进行更好/最差的分类。

例如：

id ：id1，id2，id3

比赛：1、0、0

PR_分数：0.18、0.5、0.12

在这种情况下，两个分数在对id1的分类上优于对id3的分类，对id1和id2的分类不同，并且在id2和id3之间存在人为的判断力，因此我的成对准确性是：

协议 = 1

分歧 = 1

成对准确性 =同意/（同意+反对）= 1/2 = 0.5

这是我第一个解决方案的代码，其中我将df的列用作数组（这有助于减少计算时间）：

def pairwise(agree, disagree):
    return(agree/(agree+disagree))

def pairwise_computing_array(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(df['PR_Score']) 

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(df)-1):  
        for j in range(i+1, len(df)):
            total += 1
            human = humanScores[i] -  humanScores[j] #difference human judg
            if human != 0:
                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score
                if pr != 0:
                    if np.sign(human) == np.sign(pr):  
                        agree += 1 #they agree in which of the two is better
                    else:
                        disagree +=1 #they do not agree in which of the two is better
                else:
                    continue;   
            else:
                continue;

    pairwise_accuracy = pairwise(agree, disagree)

    return(agree, disagree, total,  pairwise_accuracy)

我尝试使用列表理解来获得更快的计算速度，但实际上比第一种解决方案要慢：

def pairwise_computing_list_comprehension(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(judgmentPR['PR_Score']) 

    sign = [np.sign(pagerankScores[i] - pagerankScores[j]) == np.sign(humanScores[i] - humanScores[j] ) 
            for i in range(len(df)) for j in range(i+1, len(df)) 
                if (np.sign(pagerankScores[i] - pagerankScores[j]) != 0 
                    and np.sign(humanScores[i] - humanScores[j])!=0)]

    agreement = sum(sign)
    disagreement = len(sign) -  agreement                             
    pairwise_accuracy = pairwise(agreement, disagreement)

    return(agreement, disagreement, pairwise_accuracy)

我无法在整个数据集上运行，因为它花费了太多时间，所以我希望可以在不到1分钟的时间内计算出一些东西。

在我的计算机上对1000行的一小部分进行的计算达到了以下性能：

code1：每个循环1.57 s±3.15 ms（平均±标准偏差，共运行7次，每个循环1次）

code2：每个循环3.51 s±10.7 ms（平均±标准偏差，共运行7次，每个循环1次）

Answer 1

您有numpy数组，为什么不只使用它呢？您可以将工作从Python卸载到C编译的代码中（通常但并非总是如此）：

首先，将向量的大小调整为1xN个矩阵：

humanScores = np.array(df['Judgement']).resize((1,-1))
pagerankScores =  np.array(judgmentPR['PR_Score']).resize((1,-1))

然后找到区别，我们只对标志感兴趣：

humanDiff = (humanScores - humanScores.T).clip(-1,1)
pagerankDiff = (pagerankScores - pagerankScores.T).clip(-1,1)

这里我假设数据是整数，所以clip函数只会产生-1、0或1。然后可以对它进行计数：

agree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff == pagerankDiff)).sum()
disagree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff != pagerankDiff)).sum()

但是上述计数是重复计算的，因为项目（i，j）和项目（j，i）在humanDiff和pagerankDiff中都是正确的相反符号。您可以考虑只求和求方阵的上三角部分：

agree = ((humanDiff != 0) &
         (pagerankDiff != 0) &
         (np.triu(humanDiff) == np.triu(pagerankDiff))
        ).sum()

Answer 2

这是在合理的时间内工作的代码，这要感谢@ juanpa.arrivillaga的建议：

from numba import jit

@jit(nopython = True)
def pairwise_computing(humanScores, pagerankScores):

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(humanScores)-1):  
        for j in range(i+1, len(humanScores)):
            total += 1
            human = humanScores[i] -  humanScores[j] #difference human judg
            if human != 0:
                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score
                if pr != 0:
                    if np.sign(human) == np.sign(pr):  
                        agree += 1 #they agree in which of the two is better
                    else:
                        disagree +=1 #they do not agree in which of the two is better
                else:
                    continue   
            else:
                continue
    pairwise_accuracy = agree/(agree+disagree)
    return(agree, disagree, total,  pairwise_accuracy)

这是我的整个数据集（58,000行）达到的性能：

每个循环7.98 s±2.78 ms（平均±标准偏差，共运行7次，每个循环1次）

Answer 3

通过利用广播，可以摆脱内部for循环，因为索引j总是比索引i领先1（即我们不回头）。但是，以下几行中的计算协议/分歧存在一个小问题：

if np.sign(human) == np.sign(pr):

我不知道该如何解决。因此，由于您更了解问题，因此我仅在此处提供框架代码以进行更多调整并使其起作用。在这里：

def pairwise_computing_array(df):

    humanScores = df['Judgement'].values
    pagerankScores = df['PR_Score'].values 

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(df)-1):
        j = i+1
        human = humanScores[i] -  humanScores[j:]   #difference human judg
        human_mask = human != 0
        if np.sum(human_mask) > 0:  # check for at least one positive case
            pr = pagerankScores[i] -  pagerankScores[j:][human_mask]  #difference pagerank score
            pr_mask = pr !=0
            if np.sum(pr_mask) > 0:  # check for at least one positive case
                # TODO: issue arises here; how to resolve when (human.shape != pr.shape) ?
                # once this `if ... else` block is fixed, it's done
                if np.sign(human) == np.sign(pr):
                    agree += 1   #they agree in which of the two is better
                else:
                    disagree +=1   #they do not agree in which of the two is better
            else:
                continue
        else:
            continue
    pairwise_accuracy = pairwise(agree, disagree)

    return(agree, disagree, total,  pairwise_accuracy)

在Python中对单个数组进行更快的双迭代

3 个答案: