如何在Dask DataFrame上正确执行多索引切片?

时间:2019-02-07 22:25:44

标签: pandas dask

我正在尝试在Dask中有效地分割两个索引。

我尝试在第二级上使用.loc,但出现此错误:

cmb.loc[(slice(0, 1), slice(1, 10))].compute() 
cmb.loc[(slice(0, 1), slice(1.0,20.0))].compute() # (2)

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>

这里是上下文:

import dask.dataframe as dd
import pandas as pd
import numpy as np

def gen_start_times():
    durations = np.clip(np.random.randn(10) * 2 + 10, 3, 25)
    time_to_next = np.clip(np.random.randn(10) * 1 + 1, 0.01, 5)
    start_plus_pad = durations + time_to_next
    start_times = np.cumsum(start_plus_pad)
    return start_times, durations

channels = range(10)

def create_many_h5_files(files_to_create, nrows=1000000):
    dfs = []
    for c in channels:
        start_times, durations = gen_start_times()
        df = pd.DataFrame({'start_time': start_times,
                           'durations': durations})
        df['channel'] = c
        dfs.append(df)
    dfs_combined = pd.concat(dfs)
    dfs_combined = dfs_combined.set_index(['channel', 'start_time']).sort_index(level=0)
    for file in files_to_create:
        dfs_combined['filename'] = file
        dfs_combined.to_hdf(file, key='/main', format='table')

if __name__ == '__main__':
    to_create = [f'df_{n}.h5' for n in range(8)]
    create_many_h5_files(to_create, nrows=100000)
    cmb = dd.read_hdf(pattern='df_*.h5', key='/main')
    cmb.loc[0].head()

    # Works, but only on first index
    cmb.loc[1].compute()
    cmb.loc[1:2].compute()
    cmb.loc[slice(0,1)].compute()
    cmb.loc[(slice(0, 1))].compute()
    cmb.loc[(slice(0, 1), slice(None))].compute() # (1)

    # Errors
    cmb.loc[(slice(0, 1), slice(1, 10))].compute() 
    cmb.loc[(slice(0, 1), slice(1.0,20.0))].compute() # (2)

    # Keeps the index level, slices on first index again
    cmb.loc[1].loc[1:10].compute()

这些是上面(1)的实际结果

cmb.loc[(slice(0, 1), slice(None))].compute().head()
                    durations filename
channel start_time                    
0       14.343985   11.167318  df_0.h5
        25.722012    9.012836  df_0.h5
        36.066957   10.266020  df_0.h5
        49.180045   11.974180  df_0.h5
        55.179495    5.989450  df_0.h5

我想上面的(2)给我这个输出:

cmb.loc[(slice(0, 1), slice(1.0,20.0))].compute().head()
                    durations filename
channel start_time                    
0       14.343985   11.167318  df_0.h5

理想情况下,如果在dask中有一个xs方法的工作原理与在熊猫中完全一样,那将立即解决我的问题:

dfs_combined.xs([slice(1, 2), slice(45, 200)],
                    level=['channel', 'start_time'])

1 个答案:

答案 0 :(得分:0)

截至2019年2月19日,Dask数据框不支持Pandas MultiIndex。