熊猫掉落功能:不可对齐的布尔系列

时间:2013-08-09 14:38:47

标签: pandas

我有两个DataFrame。第一个df0:

Name       CHR  MAPINFO     PMG         APA 
cg13869341  1   15865   0.8954256   0.8409144
cg14008030  1   18827   0.5941512   0.712414
cg12045430  1   29407   0.1110794   0.1302404
cg20826792  1   29425   0.177532    0.1304049
cg00381604  1   29435   0.09003246  0.04180672
cg20253340  1   68849   0.4738799   0.444899

结束第二个df1:

probe   Chromosome  Gstart  Gend
A_23_P11744     1   4363    39806
A_33_P3365932   1   4363    39806
A_32_P923011    1   24554   46081

我想迭代df0 [“MAPINFO”]并删除与条件不匹配的行并将该方法附加到另一个df。我的代码如下:

for pos in df0['MAPINFO']:
    cond = (( pos < df1['Gstart']) & ( pos > df1['Gend']))
    print df0.drop(df0[cond].index.values).mean(axis=0, skipna=True, level=None)

,它给出以下错误消息:

/usr/lib64/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/frame.py:2021: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
"DataFrame index.", UserWarning)
Traceback (most recent call last):
 File "/home/ferreirafm/bin/cpg_means.py", line 239, in <module>
main()
File "/home/ferreirafm/bin/cpg_means.py", line 231, in main
import2df(infprobe, infchrom)
File "/home/ferreirafm/bin/cpg_means.py", line 20, in import2df
df0.drop(df0[cond].index.values)#.mean(axis=0, skipna=True, level=None)
File "/usr/lib64/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1995, in __getitem__
return self._getitem_array(key)
File "/usr/lib64/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 2027, in _getitem_array
key = _check_bool_indexer(self.index, key)
File "/usr/lib64/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 1017, in _check_bool_indexer
raise IndexingError('Unalignable boolean Series key provided')
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided

我几乎可以肯定,这段代码曾用于以前版本的Pandas。但是,我无法弄清楚什么是错的。 任何帮助表示赞赏。

预期结果: 观察到df0的最后一行将被丢弃,因为第一行(15865)的df1'MAPINFO'在df1范围Gstart和Gend之外。因此,结果将是来自df0(PGM和APA的手段)的非下降线的列的平均值。也就是说,得到的df将是:

Name       CHR  MAPINFO     PMG         APA 
cg13869341  1   15865   0.8954256   0.8409144
cg14008030  1   18827   0.5941512   0.712414
cg12045430  1   29407   0.1110794   0.1302404
cg20826792  1   29425   0.177532    0.1304049
cg00381604  1   29435   0.09003246  0.04180672

df0“cg20253340 1 68849 0.4738799 0.444899”的最后一行被删除,并逐行排列。

1 个答案:

答案 0 :(得分:1)

我的解决方案是制作实施包含标准的bool索引,然后使用它:

import pandas as pd

df0 = pd.DataFrame.from_records([["cg13869341", 1, 15865, 0.8954256, 0.8409144],
                                 ["cg14008030", 1, 18827, 0.5941512, 0.712414],
                                 ["cg12045430", 1, 29407, 0.1110794, 0.1302404],
                                 ["cg20826792", 1, 29425, 0.177532, 0.1304049],
                                 ["cg00381604", 1, 29435, 0.09003246, 0.04180672],
                                 ["cg20253340", 1, 68849, 0.4738799, 0.444899]],
                                columns = ["Name", "CHR", "MAPINFO", "PMG", "APA"])

df1 = pd.DataFrame.from_records([["A_23_P11744", 1, 4363, 39806],
                                 ["A_33_P3365932", 1, 4363, 39806],
                                 ["A_32_P923011", 1, 24554, 46081]],
                                columns = ["probe", "Chromosome", "Gstart", "Gend"])

F = df0.MAPINFO.apply(lambda x: ((df1.Gstart <= x) & (x <= df1.Gend)).any())
print df0[F] ## as you exepected

# mean by rows
res = df0[F]
res['mean'] = df0[F][['PMG', 'APA']].mean(1)
print res

# mean by columns
print df0[F][['PMG', 'APA']].mean(0)