我一直在使用Dask Concurrent.futures documentation,我在(过时的)Random Forest example遇到了一些麻烦。具体来说,使用位置索引将dask数据帧切片为测试/训练分割:
train = dfs.loc[:-1]
test = dfs.loc[-1]
无济于事我也尝试过:
KeyError Traceback (most recent call last)
/opt/anaconda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2524 try:
-> 2525 return self._engine.get_loc(key)
2526 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-21-fff88783d91d> in <module>()
7 test = dfs.loc[-1]
8
----> 9 estimators = c.map(fit, train)
10 progress(estimators, complete=False)
/opt/anaconda/lib/python3.5/site-packages/distributed/client.py in map(self, func, *iterables, **kwargs)
1243 raise ValueError("Only use allow_other_workers= if using workers=")
1244
-> 1245 iterables = list(zip(*zip(*iterables)))
1246 if isinstance(key, list):
1247 keys = key
/opt/anaconda/lib/python3.5/site-packages/dask/dataframe/core.py in __getitem__(self, key)
2284
2285 # error is raised from pandas
-> 2286 meta = self._meta[_extract_meta(key)]
2287 dsk = dict(((name, i), (operator.getitem, (self._name, i), key))
2288 for i in range(self.npartitions))
/opt/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
2137 return self._getitem_multilevel(key)
2138 else:
-> 2139 return self._getitem_column(key)
2140
2141 def _getitem_column(self, key):
/opt/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2144 # get column
2145 if self.columns.is_unique:
-> 2146 return self._get_item_cache(key)
2147
2148 # duplicate columns & possible reduce dimensionality
/opt/anaconda/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
1840 res = cache.get(item)
1841 if res is None:
-> 1842 values = self._data.get(item)
1843 res = self._box_item_values(item, values)
1844 cache[item] = res
/opt/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
3841
3842 if not isna(item):
-> 3843 loc = self.items.get_loc(item)
3844 else:
3845 indexer = np.arange(len(self.items))[isna(self.items)]
/opt/anaconda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2525 return self._engine.get_loc(key)
2526 except KeyError:
-> 2527 return self._engine.get_loc(self._maybe_cast_indexer(key))
2528
2529 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
给了我错误:
from distributed import Executor, s3
e = Executor('52.91.1.177:8786')
dfs = s3.read_csv('dask-data/nyc-taxi/2015',
parse_dates=['tpep_pickup_datetime',
'tpep_dropoff_datetime'],
collection=False)
dfs = e.compute(dfs)
progress(dfs)
在Dask中使用位置索引的正确方法是什么,以及在随机森林示例中将数据帧切割为测试/火车拆分的正确方法是什么?
类似未回答的问题:What is the equivalent to iloc for dask dataframe?
编辑: 创建指向Pandas数据帧的原始期货列表失败:
ImportError Traceback (most recent call last)
<ipython-input-3-25aea53688ef> in <module>()
----> 1 from distributed import s3
2
3 dfs = s3.read_csv('dask-data/nyc-taxi/2015',
4 parse_dates=['tpep_pickup_datetime',
5 'tpep_dropoff_datetime'],
ImportError: cannot import name 's3'
抛出错误:
import dask.dataframe as dd
dfs = dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv',
parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
storage_options={'anon': True})
dfs = c.persist(dfs)
progress(dfs)
误解了这是一个dask数据框而不是指向Pandas数据帧的未来列表,我试过:
map
这就是我现在引起的索引问题。我应该如何修改它以从S3存储桶读取指向Pandas数据帧的未来列表,如博客文章中所述?
答案 0 :(得分:2)
我建议阅读文档,而不是博客文章。旧的博客很可能很快就会过时。文档保持最新。
在此代码中,博客dfs
是一个期货到数据框的列表,而不是一个dask数据框。
train = dfs[:-1]
test = dfs[-1]
如果您正在寻找进行火车测试分组,那么我建议使用random_split方法。
不支持位置索引,也不可能在近乎温和的未来。
如果您有完整的索引和分部,则支持基于值的索引(.loc
),请参阅divisions docs