在Python中对单个数组进行更快的双迭代

时间:2019-04-25 16:06:04

标签: python python-3.x pandas performance numpy

我想找到一种方法来更快地计算成对准确性,即将比较同一数组中的元素(在本例中为panda df列),计算它们之间的差异,然后比较所获得的两个结果。我将有一个数据框 df ,其中包含3列(文档的 id Jugment ),它们代表人工评估,并且是一个int对象, PR_score 代表该文档的pagerank,它是一个float对象),我想检查一下他们是否同意对一个文档进行更好/最差的分类。


例如:

id :id1,id2,id3

比赛:1、0、0

PR_分数:0.18、0.5、0.12

在这种情况下,两个分数在对id1的分类上优于对id3的分类,对id1和id2的分类不同,并且在id2和id3之间存在人为的判断力,因此我的成对准确性是:

协议 = 1

分歧 = 1

成对准确性 =同意/(同意+反对)= 1/2 = 0.5


这是我第一个解决方案的代码,其中我将df的列用作数组(这有助于减少计算时间):

def pairwise(agree, disagree):
    return(agree/(agree+disagree))

def pairwise_computing_array(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(df['PR_Score']) 

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(df)-1):  
        for j in range(i+1, len(df)):
            total += 1
            human = humanScores[i] -  humanScores[j] #difference human judg
            if human != 0:
                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score
                if pr != 0:
                    if np.sign(human) == np.sign(pr):  
                        agree += 1 #they agree in which of the two is better
                    else:
                        disagree +=1 #they do not agree in which of the two is better
                else:
                    continue;   
            else:
                continue;

    pairwise_accuracy = pairwise(agree, disagree)

    return(agree, disagree, total,  pairwise_accuracy)


我尝试使用列表理解来获得更快的计算速度,但实际上比第一种解决方案要慢:

def pairwise_computing_list_comprehension(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(judgmentPR['PR_Score']) 

    sign = [np.sign(pagerankScores[i] - pagerankScores[j]) == np.sign(humanScores[i] - humanScores[j] ) 
            for i in range(len(df)) for j in range(i+1, len(df)) 
                if (np.sign(pagerankScores[i] - pagerankScores[j]) != 0 
                    and np.sign(humanScores[i] - humanScores[j])!=0)]

    agreement = sum(sign)
    disagreement = len(sign) -  agreement                             
    pairwise_accuracy = pairwise(agreement, disagreement)

    return(agreement, disagreement, pairwise_accuracy)

我无法在整个数据集上运行,因为它花费了太多时间,所以我希望可以在不到1分钟的时间内计算出一些东西。

在我的计算机上对1000行的一小部分进行的计算达到了以下性能:

code1: 每个循环1.57 s±3.15 ms(平均±标准偏差,共运行7次,每个循环1次)

code2: 每个循环3.51 s±10.7 ms(平均±标准偏差,共运行7次,每个循环1次)

3 个答案:

答案 0 :(得分:1)

您有numpy数组,为什么不只使用它呢?您可以将工作从Python卸载到C编译的代码中(通常但并非总是如此):

首先,将向量的大小调整为1xN个矩阵:

humanScores = np.array(df['Judgement']).resize((1,-1))
pagerankScores =  np.array(judgmentPR['PR_Score']).resize((1,-1))

然后找到区别,我们只对标志感兴趣:

humanDiff = (humanScores - humanScores.T).clip(-1,1)
pagerankDiff = (pagerankScores - pagerankScores.T).clip(-1,1)

这里我假设数据是整数,所以clip函数只会产生-1、0或1。然后可以对它进行计数:

agree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff == pagerankDiff)).sum()
disagree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff != pagerankDiff)).sum()

但是上述计数是重复计算的,因为项目(i,j)和项目(j,i)在humanDiffpagerankDiff中都是正确的相反符号。您可以考虑只求和求方阵的上三角部分:

agree = ((humanDiff != 0) &
         (pagerankDiff != 0) &
         (np.triu(humanDiff) == np.triu(pagerankDiff))
        ).sum()

答案 1 :(得分:1)

这是在合理的时间内工作的代码,这要感谢@ juanpa.arrivillaga的建议:

from numba import jit

@jit(nopython = True)
def pairwise_computing(humanScores, pagerankScores):

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(humanScores)-1):  
        for j in range(i+1, len(humanScores)):
            total += 1
            human = humanScores[i] -  humanScores[j] #difference human judg
            if human != 0:
                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score
                if pr != 0:
                    if np.sign(human) == np.sign(pr):  
                        agree += 1 #they agree in which of the two is better
                    else:
                        disagree +=1 #they do not agree in which of the two is better
                else:
                    continue   
            else:
                continue
    pairwise_accuracy = agree/(agree+disagree)
    return(agree, disagree, total,  pairwise_accuracy)

这是我的整个数据集(58,000行)达到的性能:

每个循环7.98 s±2.78 ms(平均±标准偏差,共运行7次,每个循环1次)

答案 2 :(得分:1)

通过利用广播,可以摆脱内部for循环,因为索引j总是比索引i领先1(即我们不回头)。但是,以下几行中的计算协议/分歧存在一个小问题:

if np.sign(human) == np.sign(pr):

我不知道该如何解决。因此,由于您更了解问题,因此我仅在此处提供框架代码以进行更多调整并使其起作用。在这里:

def pairwise_computing_array(df):

    humanScores = df['Judgement'].values
    pagerankScores = df['PR_Score'].values 

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(df)-1):
        j = i+1
        human = humanScores[i] -  humanScores[j:]   #difference human judg
        human_mask = human != 0
        if np.sum(human_mask) > 0:  # check for at least one positive case
            pr = pagerankScores[i] -  pagerankScores[j:][human_mask]  #difference pagerank score
            pr_mask = pr !=0
            if np.sum(pr_mask) > 0:  # check for at least one positive case
                # TODO: issue arises here; how to resolve when (human.shape != pr.shape) ?
                # once this `if ... else` block is fixed, it's done
                if np.sign(human) == np.sign(pr):
                    agree += 1   #they agree in which of the two is better
                else:
                    disagree +=1   #they do not agree in which of the two is better
            else:
                continue
        else:
            continue
    pairwise_accuracy = pairwise(agree, disagree)

    return(agree, disagree, total,  pairwise_accuracy)
相关问题