如何在Dask DataFrame中创建唯一索引?

时间:2019-06-06 10:54:43

标签: python dataframe dask

想象一下,我有一个来自read_csv的Dask DataFrame。

我如何为此建立唯一索引?

注意***

reset_index在每个分区中构建一个单调递增的索引。这意味着分区1为(0,1,2,3,4,5,...) 分区2的(0,1,2,3,4,5,...)(分区3的(0,1,2,3,4,5,...)等等)

我希望每一行都有唯一的索引。

2 个答案:

答案 0 :(得分:2)

接受的答案创建一个随机索引,而下面的方法创建一个单调递增的索引:

import dask.dataframe as dd
import pandas as pd

# save some data into unindexed csv
num_rows = 15
df = pd.DataFrame(range(num_rows), columns=['x'])
df.to_csv('dask_test.csv', index=False)

# read from csv
ddf = dd.read_csv('dask_test.csv', blocksize=10)

# assume that rows are already ordered (so no sorting is needed)
# then can modify the index using the lengths of partitions
cumlens = ddf.map_partitions(len).compute().cumsum()

# since processing will be done on a partition-by-partition basis, save them
# individually
new_partitions = [ddf.partitions[0]]
for npart, partition in enumerate(ddf.partitions[1:].partitions):
    partition.index = partition.index + cumlens[npart]
    new_partitions.append(partition)

# this is our new ddf
ddf = dd.concat(new_partitions)

此代码基于对不同问题的回答:Process dask dataframe by chunks of rows

答案 1 :(得分:0)

这是我使用map_partitions和真正的随机数构建唯一索引的方法(函数),因为reset_index会在每个分区中创建单调递增的索引!

import sys
import random
from dask.distributed import Client

client = Client()

def createDDF_u_idx(ddf):

    def create_u_idx(df):
        rng = random.SystemRandom()
        p_id = str(rng.randint(0, sys.maxsize))

        df['idx'] = [p_id + 'a' + str(x) for x in range(df.index.size)]

        return df

    ddf = ddf.map_partitions(lambda df: create_u_idx(df), meta={...your_prev_columns.., 'idx': 'str'})
    ddf = client.persist(ddf)  # compute up to here, keep results in memory
    ddf = ddf.set_index('idx')

    return ddf
相关问题