根据两列从得分最高的组中选择行

时间:2019-01-02 14:20:26

标签: python pandas

数据

     Sentence  Score_Unigram  Score_Bigram  versionId
0    As of   Dat              5             1  269004158
1     Date Docum              4             3  269004158
2    As of   Dat              4             1  269004158
3     Date Docum              5             3  345973060
4    x Indicate               4             1  372529352
5     Date Docum              5             3  372529352
6   1 Financial               9             1  372529352
7   020 per shar              2             0  372529352
8     Date $ in               8             1  372529352
9     Date $ in               9             4  372529352
10   4 ---------              4             1  372529352
11    Date Begin              1             0  372529352

必需的输出

       Sentence  Score_Unigram  Score_Bigram  versionId
0   As of   Dat              5             1  269004158
3    Date Docum              5             3  345973060
9    Date $ in               9             4  372529352
  

客观

按版本ID分组,获取具有最大Score_unigram的行,如果结果大于一,则检查Score_Bigram列并获取具有最高值的行(如果有多个此类行,则全部返回)

  

我尝试了什么

maximum = 0
index_to_pick = []

for index,row_data in a.iterrows():
    if row_data['Score_Unigram'] > maximum:
        maximum = row_data['Score_Unigram']
        score_bigram = row_data['Score_Bigram']
        index_to_pick.append(index)

    elif row_data['Score_Unigram'] == maximum:
        if row_data['Score_Bigram'] > score_bigram:

            maximum = row_data['Score_Unigram']
            score_bigram = row_data['Score_Bigram']
            index_to_pick = []
            index_to_pick.append(index)

        elif row_data['Score_Bigram'] == score_bigram:
            index_to_pick.append(index)

   a.loc[[index_to_pick[0]]]

输出

       Sentence  Score_Unigram  Score_Bigram  versionId
5    Date $ in               9             4  372529352

好吧,我猜这种方法不太好(因为数据很大),正在寻找一种有效的方法。 我尝试了idxmax,但只返回了前一个。可能是重复的,但找不到。感谢您的帮助!!

3 个答案:

答案 0 :(得分:2)

通过boolean indexing使用双重过滤-首先通过第一列max的{​​{1}}然后使用Score_Unigram进行二次过滤:

Score_Bigram

答案 1 :(得分:1)

在您的df上尝试:

df.sort_values(['Score_Unigram','Score_Bigram'],ascending=False).head(1)

输出:

    Sentence     Score_Unigram  Score_Bigram  versionId
5   Date $ in               9             4  372529352

答案 2 :(得分:1)

我相信您无需对数据进行排序,只需将其与这两列的max值进行比较

df[ (df['Score_Unigram'] == df['Score_Unigram'].max()) & 
    (df['Score_Bigram'] == df['Score_Bigram'].max()) ]