Question

想象一下，我有一个来自read_csv的Dask DataFrame。

我如何为此建立唯一索引？

注意***

reset_index在每个分区中构建一个单调递增的索引。这意味着分区1为（0,1,2,3,4,5，...）分区2的（0,1,2,3,4,5，...）（分区3的（0,1,2,3,4,5，...）等等）

我希望每一行都有唯一的索引。

Answer 1

接受的答案创建一个随机索引，而下面的方法创建一个单调递增的索引：

import dask.dataframe as dd
import pandas as pd

# save some data into unindexed csv
num_rows = 15
df = pd.DataFrame(range(num_rows), columns=['x'])
df.to_csv('dask_test.csv', index=False)

# read from csv
ddf = dd.read_csv('dask_test.csv', blocksize=10)

# assume that rows are already ordered (so no sorting is needed)
# then can modify the index using the lengths of partitions
cumlens = ddf.map_partitions(len).compute().cumsum()

# since processing will be done on a partition-by-partition basis, save them
# individually
new_partitions = [ddf.partitions[0]]
for npart, partition in enumerate(ddf.partitions[1:].partitions):
    partition.index = partition.index + cumlens[npart]
    new_partitions.append(partition)

# this is our new ddf
ddf = dd.concat(new_partitions)

此代码基于对不同问题的回答：Process dask dataframe by chunks of rows

Answer 2

这是我使用map_partitions和真正的随机数构建唯一索引的方法（函数），因为reset_index会在每个分区中创建单调递增的索引！

import sys
import random
from dask.distributed import Client

client = Client()

def createDDF_u_idx(ddf):

    def create_u_idx(df):
        rng = random.SystemRandom()
        p_id = str(rng.randint(0, sys.maxsize))

        df['idx'] = [p_id + 'a' + str(x) for x in range(df.index.size)]

        return df

    ddf = ddf.map_partitions(lambda df: create_u_idx(df), meta={...your_prev_columns.., 'idx': 'str'})
    ddf = client.persist(ddf)  # compute up to here, keep results in memory
    ddf = ddf.set_index('idx')

    return ddf

如何在Dask DataFrame中创建唯一索引？

2 个答案: