想象一下,我有一个来自read_csv的Dask DataFrame。
我如何为此建立唯一索引?
注意***
reset_index在每个分区中构建一个单调递增的索引。这意味着分区1为(0,1,2,3,4,5,...) 分区2的(0,1,2,3,4,5,...)(分区3的(0,1,2,3,4,5,...)等等)
我希望每一行都有唯一的索引。
答案 0 :(得分:2)
接受的答案创建一个随机索引,而下面的方法创建一个单调递增的索引:
import dask.dataframe as dd
import pandas as pd
# save some data into unindexed csv
num_rows = 15
df = pd.DataFrame(range(num_rows), columns=['x'])
df.to_csv('dask_test.csv', index=False)
# read from csv
ddf = dd.read_csv('dask_test.csv', blocksize=10)
# assume that rows are already ordered (so no sorting is needed)
# then can modify the index using the lengths of partitions
cumlens = ddf.map_partitions(len).compute().cumsum()
# since processing will be done on a partition-by-partition basis, save them
# individually
new_partitions = [ddf.partitions[0]]
for npart, partition in enumerate(ddf.partitions[1:].partitions):
partition.index = partition.index + cumlens[npart]
new_partitions.append(partition)
# this is our new ddf
ddf = dd.concat(new_partitions)
此代码基于对不同问题的回答:Process dask dataframe by chunks of rows
答案 1 :(得分:0)
这是我使用map_partitions和真正的随机数构建唯一索引的方法(函数),因为reset_index会在每个分区中创建单调递增的索引!
import sys
import random
from dask.distributed import Client
client = Client()
def createDDF_u_idx(ddf):
def create_u_idx(df):
rng = random.SystemRandom()
p_id = str(rng.randint(0, sys.maxsize))
df['idx'] = [p_id + 'a' + str(x) for x in range(df.index.size)]
return df
ddf = ddf.map_partitions(lambda df: create_u_idx(df), meta={...your_prev_columns.., 'idx': 'str'})
ddf = client.persist(ddf) # compute up to here, keep results in memory
ddf = ddf.set_index('idx')
return ddf