根据列值选择数据框的行

时间:2019-06-07 08:59:19

标签: pandas dataframe

问题

我正在一个机器学习项目中,旨在查看分类器倾向于出错的原始数据(文本)以及他们没有共识的哪种数据。

现在我有一个带有标签的数据框,2个分类器的预测结果和文本数据。我想知道是否有一种简单的方法可以根据带有预测或标签的列的某些设置操作来选择行。

数据可能看起来像

   score                                             review     svm_pred  dnn_pred
0      0  I went and saw this movie last night after bei...            0         1
1      1  Actor turned director Bill Paxton follows up h...            1         1
2      1  As a recreational golfer with some knowledge o...            0         1
3      1  I saw this film in a sneak preview, and it is ...            1         1
4      1  Bill Paxton has taken the true story of the 19...            1         1
5      1  I saw this film on September 1st, 2005 in Indi...            1         1
6      1  Maybe I'm reading into this too much, but I wo...            0         1
7      1  I felt this film did have many good qualities....            1         1
8      1  This movie is amazing because the fact that th...            1         1
9      0  "Quitting" may be as much about exiting a pre-...            1         1


例如,我要选择都出错的行,那么将返回 index 9。

此处提供了一个虚构的MWE数据示例

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3), columns=["score", "svm_pred", "dnn_pred"])

返回

   score  svm_pred  dnn_pred
0      0         1         0
1      0         0         1
2      0         0         0
3      1         0         0
4      0         0         1
5      0         1         1
6      1         0         1
7      0         1         1
8      1         1         1
9      1         1         1

我做了什么

我知道我可以列出所有可能的组合,000、001等。但是,

  • 当我想比较更多分类器时,这是不可行的。
  • 这不适用于多类分类问题。

有人可以帮我吗,谢谢。

为什么这个问题不重复

现有答案仅考虑列数受限的情况。但是,在我的应用中,分类器(即列)给出的预测数量可能很大,这使得现有答案不太适用。

同时,首先发现在特定的应用程序中使用pd.Series.ne函数会使用此功能,这可能会给具有类似困惑的人带来启发。

2 个答案:

答案 0 :(得分:1)

您可以在选择行时使用设置操作:

# returns indexes of those rows where score is equal to svm prediction and dnn prediction
df[(df['score'] == df['svm_pred']) & (df['score'] == df['dnn_pred'])].index


 # returns indexes of those rows where both predictions are wrong
 df[(df['score'] != df['svm_pred']) & (df['score'] != df['dnn_pred'])].index

 # returns indexes of those rows where either predictions are wrong
 df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])].index

如果您不仅对索引感兴趣,还对完整的行感兴趣,请省略最后一部分:

# returns rows where either predictions are wrong
df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])]

答案 1 :(得分:1)

创建一个“数量不正确的分类器”的辅助器Series,您可以对该逻辑器进行逻辑操作。 这假设true score在第1列中,后续的预测值在第2列起-您可能需要相应地更新切片索引

s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)

用法示例:

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3),
                  columns=["score", "svm_pred", "dnn_pred"])

s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)

# Return rows where all classifers got it right
df[s.eq(0)]

   score  svm_pred  dnn_pred
2      0         0         0
8      1         1         1
9      1         1         1

# Return rows where 1 classifer got it wrong
df[s.eq(1)]

   score  svm_pred  dnn_pred
0      0         1         0
1      0         0         1
4      0         0         1
6      1         0         1

# Return rows where all classifers got it wrong
df[s.eq(2)]

   score  svm_pred  dnn_pred
3      1         0         0
5      0         1         1
7      0         1         1