Question

我想只读取HDF5文件中的特定列并在这些列上传递条件。我担心的是我不想将所有HDF5文件作为内存中的数据帧获取。我想只根据他们的条件得到我必要的专栏。

columns=['col1', 'col2']
condition= "col2==1"
groupname='\path\to\group'
Hdf5File=os.path.join('path\to\hdf5.h5')
with pd.HDFStore(Hdf5File, mode='r', format='table') as store:
     if groupname in store:
        df=pd.read_hdf(store, key=groupname, columns=columns, where=["col2==1"])

我收到错误：

TypeError：读取Fixed时无法传递列规范格式商店。必须完整选择此商店

然后我使用下面的行只返回特定的列：

df=store[groupname][columns]

但我不知道如何通过条件。

Answer 1

为了能够有条件地读取HDF5文件，它们必须以table格式保存，并且必须索引相应的列。

演示：

df = pd.DataFrame(np.random.rand(100,5), columns=list('abcde'))
df.to_hdf('c:/temp/file.h5', 'df_key', format='t', data_columns=True)

In [10]: pd.read_hdf('c:/temp/file.h5', 'df_key', where="a > 0.5 and a < 0.75")
Out[10]:
           a         b         c         d         e
3   0.744123  0.515697  0.005335  0.017147  0.176254
5   0.555202  0.074128  0.874943  0.660555  0.776340
6   0.667145  0.278355  0.661728  0.705750  0.623682
8   0.701163  0.429860  0.223079  0.735633  0.476182
14  0.645130  0.302878  0.428298  0.969632  0.983690
15  0.633334  0.898632  0.881866  0.228983  0.216519
16  0.535633  0.906661  0.221823  0.608291  0.330101
17  0.715708  0.478515  0.002676  0.231314  0.075967
18  0.587762  0.262281  0.458854  0.811845  0.921100
21  0.551251  0.537855  0.906546  0.169346  0.063612
..       ...       ...       ...       ...       ...
68  0.610958  0.874373  0.785681  0.147954  0.966443
72  0.619666  0.818202  0.378740  0.416452  0.903129
73  0.500782  0.536064  0.697678  0.654602  0.054445
74  0.638659  0.518900  0.210444  0.308874  0.604929
76  0.696883  0.601130  0.402640  0.150834  0.264218
77  0.692149  0.963457  0.364050  0.152215  0.622544
85  0.737854  0.055863  0.346940  0.003907  0.678405
91  0.644924  0.840488  0.151190  0.566749  0.181861
93  0.710590  0.900474  0.061603  0.144200  0.946062
95  0.601144  0.288909  0.074561  0.615098  0.737097

[33 rows x 5 columns]

<强>更新

如果您无法更改HDF5文件，请考虑以下技巧：

In [13]: df = pd.concat([x.query("0.5 < a < 0.75")
                         for x in pd.read_hdf('c:/temp/file.h5', 'df_key', chunksize=10)],
                        ignore_index=True)

In [14]: df
Out[14]:
           a         b         c         d         e
0   0.744123  0.515697  0.005335  0.017147  0.176254
1   0.555202  0.074128  0.874943  0.660555  0.776340
2   0.667145  0.278355  0.661728  0.705750  0.623682
3   0.701163  0.429860  0.223079  0.735633  0.476182
4   0.645130  0.302878  0.428298  0.969632  0.983690
5   0.633334  0.898632  0.881866  0.228983  0.216519
6   0.535633  0.906661  0.221823  0.608291  0.330101
7   0.715708  0.478515  0.002676  0.231314  0.075967
8   0.587762  0.262281  0.458854  0.811845  0.921100
9   0.551251  0.537855  0.906546  0.169346  0.063612
..       ...       ...       ...       ...       ...
23  0.610958  0.874373  0.785681  0.147954  0.966443
24  0.619666  0.818202  0.378740  0.416452  0.903129
25  0.500782  0.536064  0.697678  0.654602  0.054445
26  0.638659  0.518900  0.210444  0.308874  0.604929
27  0.696883  0.601130  0.402640  0.150834  0.264218
28  0.692149  0.963457  0.364050  0.152215  0.622544
29  0.737854  0.055863  0.346940  0.003907  0.678405
30  0.644924  0.840488  0.151190  0.566749  0.181861
31  0.710590  0.900474  0.061603  0.144200  0.946062
32  0.601144  0.288909  0.074561  0.615098  0.737097

[33 rows x 5 columns]

从hdf5文件中读取特定列并传递条件

1 个答案: