我正在使用 Dask 将1100万行的csv加载到数据框中并执行计算。我已经达到了需要条件逻辑的位置-如果是,那就是其他。
例如,如果我要使用熊猫,则可以执行以下操作,其中使用numpy select语句以及一系列条件和结果。该语句大约需要35秒才能运行-不错,但还不错:
df["AndHeathSolRadFact"] = np.select(
[
(df['Month'].between(8,12)),
(df['Month'].between(1,2) & df['CloudCover']>30) #Array of CONDITIONS
], #list of conditions
[1, 1], #Array of RESULTS (must match conditions)
default=0) #DEFAULT if no match
我希望做的是在一个dask数据框中使用 dask 来完成此操作,而不必先将我的 dask 数据框转换为一个pandas数据框,然后再回来。 这使我能够: -使用多线程 -使用大于可用内存的数据框 -可能会加快结果。
示例CSV
Location,Date,Temperature,RH,WindDir,WindSpeed,DroughtFactor,Curing,CloudCover
1075,2019-20-09 04:00,6.8,99.3,143.9,5.6,10.0,93.0,1.0
1075,2019-20-09 05:00,6.4,100.0,93.6,7.2,10.0,93.0,1.0
1075,2019-20-09 06:00,6.7,99.3,130.3,6.9,10.0,93.0,1.0
1075,2019-20-09 07:00,8.6,95.4,68.5,6.3,10.0,93.0,1.0
1075,2019-20-09 08:00,12.2,76.0,86.4,6.1,10.0,93.0,1.0
完整代码以获取最小可行样本
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
import numpy as np
# Dataframes implement the Pandas API
import dask.dataframe as dd
from timeit import default_timer as timer
start = timer()
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
#Convert back to a Dask dataframe because we want that juicy parallelism
ddf2 = dd.from_pandas(df,npartitions=4)
del [df]
print(ddf2.head())
#print(ddf.tail())
end = timer()
print(end - start)
#Clean up remaining dataframes
del [[ddf2]
答案 0 :(得分:0)
听起来您正在寻找dd.Series.where
答案 1 :(得分:0)
所以,我能想到的答案是效果最好的:
#Create a helper column where we store the value we want to set the column to later.
ddf['Helper'] = 1
#Create the column where we will be setting values, and give it a default value
ddf['AndHeathSolRadFact'] = 0
#Break the logic out into separate where clauses. Rather than looping we will be selecting those rows
#where the conditions are met and then set the value we went. We are required to use the helper
#column value because we cannot set values directly, but we can match from another column.
#First, a very simple clause. If Temperature is greater than or equal to 8, make
#AndHeathSolRadFact equal to the value in Helper
#Note that at the end, after the comma, we preserve the existing cell value if the condition is not met
ddf['AndHeathSolRadFact'] = (ddf.Helper).where(ddf.Temperature >= 8, ddf.AndHeathSolRadFact)
#A more complex example
#this is the same as the above, but demonstrates how to use a compound select statement where
#we evaluate multiple conditions and then set the value.
ddf['AndHeathSolRadFact'] = (ddf.Helper).where(((ddf.Temperature == 6.8) & (ddf.RH == 99.3)), ddf.AndHeathSolRadFact)
我是新手,但是我认为这种方法被认为是矢量化的。它充分利用了数组,并且评估非常快。 添加新列,将其填充为0,评估两个select语句并替换目标行中的值,只会对11m行npartitions = 4的数据集的处理时间增加 0.2s 。
以前,在大熊猫中使用类似方法花费了大约45秒钟。
要做的唯一一件事就是一旦完成就删除帮助器列。目前,我不确定如何执行此操作。