如何使用np.where函数修复错误?

时间:2019-03-30 15:16:02

标签: python pandas numpy dataframe

我正在尝试使用“ where”和“ count”功能的组合重新编码熊猫中的列值。期望的结果是在“ valence_median_split”列中从标记为“ Low_Valence”的行中选择200个随机行,并从标记为“ Low_Valence”的行中选择200个随机行。但是,这似乎不起作用。

这是df:

df.head()

Out[34]: 
              ID Category  Num Vert_Horizon Description  Fem_Valence_Mean  \
0  Animals_001_h  Animals    1            h  Dead Stork              2.40   
1  Animals_002_v  Animals    2            v        Lion              6.31   
2  Animals_003_h  Animals    3            h       Snake              5.14   
3  Animals_004_v  Animals    4            v        Wolf              4.55   
4  Animals_005_h  Animals    5            h         Bat              5.29   

   Fem_Valence_SD  Fem_Av_Ap_Mean  Fem_Av/Ap_SD  Arousal_Mean  \
0            1.30            3.03          1.47          6.72   
1            2.19            5.96          2.24          6.69   
2            1.19            5.14          1.75          5.34   
3            1.87            4.82          2.27          6.84   
4            1.56            4.61          1.81          5.50   

          Luminance  Contrast  JPEG_size80   LABL   LABA  \
0          ...              126.05     68.45       263028  51.75  -0.39   
1          ...              123.41     32.34       250208  52.39  10.63   
2          ...              135.28     59.92       190887  55.45   0.25   
3          ...              122.15     75.10       282350  49.84   3.82   
4          ...              131.81     59.77       329325  54.26  -0.34   

    LABB  Entropy  Classification  temp_selection  valence_median_split  
0  16.93     7.86                            High           Low_Valence  
1  30.30     6.71                             NaN          High_Valence  
2   4.41     7.83                            High           Low_Valence  
3   1.36     7.69                            High           Low_Valence  
4  -0.95     7.82                            High           Low_Valence  

[5 rows x 35 columns]

这是我尝试过的:

df['temp_selection'] = ''
df['temp_selection'] = np.where(df['valence_median_split'] == 'Low_Valence', df['valence_median_split'].sample(n=200).reindex(df.index), 'Low')
df['temp_selection'] = np.where(df['valence_median_split'] == 'High_Valence', df['valence_median_split'].sample(n=200).reindex(df.index), 'High')
df.temp_selection.unique()

但是,结果表明这不起作用:

array(['High', nan, 'High_Valence'], dtype=object)

我想知道合并这些功能是否有错误。

以下是可重现的示例:

d = {'col1': [1, 2, 3, 4, 3, 3, 2, 2], 'col2': [1, 2, 3, 4, 3, 3, 2, 2]}
df = pd.DataFrame(data=d)
df['valence_median_split'] = ''
#Get median of valence
valence_median = df['col1'].median()
df['valence_median_split'] = np.where(df['col2'] < valence_median, 'Low_Valence', 'High_Valence')
df['temp_selection'] = ''
df['temp_selection'] = np.where(df['valence_median_split'] == 'Low_Valence', df['valence_median_split'].sample(n=2).reindex(df.index), 'Low')
df['temp_selection'] = np.where(df['valence_median_split'] == 'High_Valence', df['valence_median_split'].sample(n=2).reindex(df.index), 'High')
df
   col1  col2 valence_median_split temp_selection
0     1     1          Low_Valence           High
1     2     2          Low_Valence           High
2     3     3         High_Valence   High_Valence
3     4     4         High_Valence            NaN
4     3     3         High_Valence            NaN
5     3     3         High_Valence   High_Valence
6     2     2          Low_Valence           High
7     2     2          Low_Valence           High

从上面的df中可以看出,在'temp_selection'中有一个'High_Valence'分类,不应存在,并且没有'Low'分类。

1 个答案:

答案 0 :(得分:1)

想法是获取过滤后的数据ans的样本索引,而不是将numpy.select使用np.where的两倍:

low = df.loc[df['valence_median_split'] == 'Low_Valence', 
                'valence_median_split'].sample(n=2).index
high = df.loc[df['valence_median_split'] == 'High_Valence',
                 'valence_median_split'].sample(n=2).index
df['temp_selection'] = np.select([df.index.isin(low), df.index.isin(high)],
                                 ['Low', 'High'], default=np.nan)

或者:

df['temp_selection'] = np.where(df.index.isin(low), 'Low', 
                       np.where(df.index.isin(high), 'High', np.nan))

print (df)
   col1  col2 valence_median_split temp_selection
0     1     1          Low_Valence            nan
1     2     2          Low_Valence            Low
2     3     3         High_Valence            nan
3     4     4         High_Valence            nan
4     3     3         High_Valence           High
5     3     3         High_Valence           High
6     2     2          Low_Valence            nan
7     2     2          Low_Valence            Low

或者:

df.loc[low, 'temp_selection'] = 'Low'
df.loc[high, 'temp_selection'] = 'High'
print (df)
   col1  col2 valence_median_split temp_selection
0     1     1          Low_Valence            NaN
1     2     2          Low_Valence            Low
2     3     3         High_Valence            NaN
3     4     4         High_Valence            NaN
4     3     3         High_Valence           High
5     3     3         High_Valence           High
6     2     2          Low_Valence            NaN
7     2     2          Low_Valence            Low

另一个想法是使用numpy.random.choice

low = np.random.choice(df.index[df['valence_median_split'] == 'Low_Valence'], size=2)
high = np.random.choice(df.index[df['valence_median_split']== 'High_Valence'], size=2)

df.loc[low, 'temp_selection'] = 'Low'
df.loc[high, 'temp_selection'] = 'High'