Question

我想选择并更改数据框单元格的值。此数据帧使用了2个索引：＆＃39; datetime＆＃39;和＆＃39; idx＆＃39;。两者都包含唯一且顺序的标签。＆＃39;日期时间＆＃39; index具有日期时间类型的日期时间标签，以及＆＃39; idx＆＃39;具有整数值标签。

import numpy as np
import pandas as pd

dt = pd.date_range("2010-10-01 00:00:00", periods=5, freq='H')
d = {'datetime': dt, 'a': np.arange(len(dt))-1,'b':np.arange(len(dt))+1}
df = pd.DataFrame(data=d)
df.set_index(keys='datetime',inplace=True,drop=True)
df.sort_index(axis=0,level='datetime',ascending=False,inplace=True)

df.loc[:,'idx'] = np.arange(0, len(df),1)+5
df.set_index('idx',drop=True,inplace=True,append=True)
print(df)

＆＃39;这是数据框：

                         a  b
datetime            idx      
2010-10-01 04:00:00 5    3  5
2010-10-01 03:00:00 6    2  4
2010-10-01 02:00:00 7    1  3
2010-10-01 01:00:00 8    0  2
2010-10-01 00:00:00 9   -1  1

＆＃39;说我想得到idx = 5的行。我怎么做？我可以用这个：

print(df.iloc[0])

然后我会得到以下结果：

a    3
b    5
Name: (2010-10-01 04:00:00, 5), dtype: int32

但我想通过指定idx值和列名＆来访问此单元格中的设置值，其中idx = 5，列=＆＃39; a＆＃39;， ＃39;一个＆＃39; 即可。我该怎么做？

请建议。

Answer 1

您可以使用DatFrame.query()方法查询MultiIndex DF：

In [54]: df
Out[54]:
                         a  b
datetime            idx
2010-10-01 04:00:00 5    3  5
2010-10-01 03:00:00 6    2  4
2010-10-01 02:00:00 7    1  3
2010-10-01 01:00:00 8    0  2
2010-10-01 00:00:00 9   -1  1

In [55]: df.query('idx==5')
Out[55]:
                         a  b
datetime            idx
2010-10-01 04:00:00 5    3  5

In [56]: df.query('idx==5')['a']
Out[56]:
datetime             idx
2010-10-01 04:00:00  5      3
Name: a, dtype: int32

如果您需要设置/更新某些单元格，也可以使用DataFrame.eval()方法：

In [61]: df.loc[df.eval('idx==5'), 'a'] = 100

In [62]: df
Out[62]:
                           a  b
datetime            idx
2010-10-01 04:00:00 5    100  5
2010-10-01 03:00:00 6      2  4
2010-10-01 02:00:00 7      1  3
2010-10-01 01:00:00 8      0  2
2010-10-01 00:00:00 9     -1  1

说明：

In [59]: df.eval('idx==5')
Out[59]:
datetime             idx
2010-10-01 04:00:00  5       True
2010-10-01 03:00:00  6      False
2010-10-01 02:00:00  7      False
2010-10-01 01:00:00  8      False
2010-10-01 00:00:00  9      False
dtype: bool

In [60]: df.loc[df.eval('idx==5')]
Out[60]:
                         a  b
datetime            idx
2010-10-01 04:00:00 5    3  5

PS如果您的原始MultiIndex没有名称，您可以使用rename_axis()方法轻松设置它们：

df.rename_axis(('datetime','idx')).query(...)

替代（更昂贵的）解决方案 - 使用sort_index() + pd.IndexSlice[]：

In [106]: df.loc[pd.IndexSlice[:,5], ['a']]
...
skipped
...
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0)'

所以我们需要先对索引进行排序：

In [107]: df.sort_index().loc[pd.IndexSlice[:,5], ['a']]
Out[107]:
                         a
datetime            idx
2010-10-01 04:00:00 5    3

Answer 2

另一种方法。

选择值：

df.xs(5, level=-1)

设定值

df.set_value(df.xs(5, level=-1).index, 'a', 100)

Answer 3

如果要在大型数据集的循环中使用，我意识到首先将数据帧的列提取到pandas Series类型的速度要快20倍，然后继续选择和分配操作。

或者

如果索引标签恰好是连续的整数，则更快（几乎快10000倍）到numpy数组。

MYGz的解决方案很好，但在我的for-loop用例中，由于这些操作占用大部分时间，因此速度太慢而无法实现。

切片和分配值为多索引的pandas唯一顺序索引的数据帧

3 个答案: