删除值计数满足条件的列(Pandas)

时间:2017-07-21 14:26:21

标签: python pandas dataframe

我在下面的格式中有一个df,有~70000列和540行。所有值均为0.0,0.5或1.0。

 VAR         1_139632_G  1_158006_T  1_172595_A  1_564650_A  1_564652_G  \
 SRR4216489         0.5         0.5         0.5         0.5         0.5   
 SRR4216786         0.5         0.5         0.5         0.5         0.5   
 SRR4216628         0.5         0.0         1.0         0.0         0.0   
 SRR4216456         0.5         0.5         0.5         0.5         0.5   
 SRR4216393         0.5         0.5         0.5         0.5         0.5   

我想删除所有列数' 0.5'值只比行数少1。到目前为止,我已经尝试过了;

total_samples = len(df.index) # Gets the number of rows
df_col_05 = df[df == 0.5].count() # returns a df with column-wise counts
df_col_05 = df_col_05.where(df_col_05 < (total_samples-1)) #replaces with Nan where the condition isn't met

我想要的是我的原始df,在df_col_05的值为&gt; =(total_samples-1)的情况下删除了所有cols,所以基本上删除了&f; df_col_05&#39;有一个NaN,但我不知道该怎么做?

我相信对于拥有比我更多熊猫经验的人来说这应该很容易(我几天前开始)

1 个答案:

答案 0 :(得分:4)

您可以将boolean indexingloc一起用于过滤列,最好使用sum获取size True DataFrame#if first column is not index set it df = df.set_index('VAR') df1 = df.loc[:, (df == 0.5).sum() >= len(df.index)-1]

#changed values in last 2 columns
print (df)
          VAR  1_139632_G  1_158006_T  1_172595_A  1_564650_A  1_564652_G
0  SRR4216489         0.5         0.5         0.5         0.0         0.0
1  SRR4216786         0.5         0.5         0.5         0.0         0.5
2  SRR4216628         0.5         0.0         1.0         0.0         0.0
3  SRR4216456         0.5         0.5         0.5         0.5         0.5
4  SRR4216393         0.5         0.5         0.5         0.5         0.5

print (df[df == 0.5].count())
VAR           0
1_139632_G    5
1_158006_T    4
1_172595_A    4
1_564650_A    2
1_564652_G    3
dtype: int64

print ((df == 0.5).sum())
VAR           0
1_139632_G    5
1_158006_T    4
1_172595_A    4
1_564650_A    2
1_564652_G    3
dtype: int64

<强>示例

#if first column is not index set it
df = df.set_index('VAR')

print ((df == 0.5).sum() >= len(df.index)-1)
1_139632_G     True
1_158006_T     True
1_172595_A     True
1_564650_A    False
1_564652_G    False
dtype: bool

print (df.loc[:, (df == 0.5).sum() >= len(df.index)-1])
            1_139632_G  1_158006_T  1_172595_A
VAR                                           
SRR4216489         0.5         0.5         0.5
SRR4216786         0.5         0.5         0.5
SRR4216628         0.5         0.0         1.0
SRR4216456         0.5         0.5         0.5
SRR4216393         0.5         0.5         0.5
set_index

另一个没有m = (df == 0.5).sum() >= len(df.index)-1 print (m) VAR False 1_139632_G True 1_158006_T True 1_172595_A True 1_564650_A False 1_564652_G False dtype: bool need_cols = ['VAR'] m.loc[need_cols] = True print (m) VAR True 1_139632_G True 1_158006_T True 1_172595_A True 1_564650_A False 1_564652_G False dtype: bool print (df.loc[:, m]) VAR 1_139632_G 1_158006_T 1_172595_A 0 SRR4216489 0.5 0.5 0.5 1 SRR4216786 0.5 0.5 0.5 2 SRR4216628 0.5 0.0 1.0 3 SRR4216456 0.5 0.5 0.5 4 SRR4216393 0.5 0.5 0.5 的解决方案,只需要定义输出中始终需要的列:

print (df[df.columns[m]])
          VAR  1_139632_G  1_158006_T  1_172595_A  1_564652_G
0  SRR4216489         0.5         0.5         0.5         0.0
1  SRR4216786         0.5         0.5         0.5         0.5
2  SRR4216628         0.5         0.0         1.0         0.0
3  SRR4216456         0.5         0.5         0.5         0.5
4  SRR4216393         0.5         0.5         0.5         0.5

类似的解决方案是单独过滤列,然后选择:

string.IndexOf