如何在Python Dask数据帧中执行位置索引

时间:2018-02-14 04:46:25

标签: python pandas dataframe dask

我一直在使用Dask Concurrent.futures documentation,我在(过时的)Random Forest example遇到了一些麻烦。具体来说,使用位置索引将dask数据帧切片为测试/训练分割:

train = dfs.loc[:-1]
test = dfs.loc[-1]

无济于事我也尝试过:

KeyError                                  Traceback (most recent call last)
/opt/anaconda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-fff88783d91d> in <module>()
      7 test = dfs.loc[-1]
      8 
----> 9 estimators = c.map(fit, train)
     10 progress(estimators, complete=False)

/opt/anaconda/lib/python3.5/site-packages/distributed/client.py in map(self, func, *iterables, **kwargs)
   1243             raise ValueError("Only use allow_other_workers= if using workers=")
   1244 
-> 1245         iterables = list(zip(*zip(*iterables)))
   1246         if isinstance(key, list):
   1247             keys = key

/opt/anaconda/lib/python3.5/site-packages/dask/dataframe/core.py in __getitem__(self, key)
   2284 
   2285             # error is raised from pandas
-> 2286             meta = self._meta[_extract_meta(key)]
   2287             dsk = dict(((name, i), (operator.getitem, (self._name, i), key))
   2288                        for i in range(self.npartitions))

/opt/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

/opt/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

/opt/anaconda/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

/opt/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

/opt/anaconda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

给了我错误:

from distributed import Executor, s3
e = Executor('52.91.1.177:8786')

dfs = s3.read_csv('dask-data/nyc-taxi/2015',
                  parse_dates=['tpep_pickup_datetime',
                               'tpep_dropoff_datetime'],
                  collection=False)
dfs = e.compute(dfs)
progress(dfs)

在Dask中使用位置索引的正确方法是什么,以及在随机森林示例中将数据帧切割为测试/火车拆分的正确方法是什么?

类似未回答的问题:What is the equivalent to iloc for dask dataframe?

编辑: 创建指向Pandas数据帧的原始期货列表失败:

ImportError                               Traceback (most recent call last)
<ipython-input-3-25aea53688ef> in <module>()
----> 1 from distributed import s3
      2 
      3 dfs = s3.read_csv('dask-data/nyc-taxi/2015', 
      4                   parse_dates=['tpep_pickup_datetime', 
      5                                'tpep_dropoff_datetime'],

ImportError: cannot import name 's3'

抛出错误:

import dask.dataframe as dd

dfs = dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv', 
                 parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
                 storage_options={'anon': True})
dfs = c.persist(dfs)
progress(dfs)

误解了这是一个dask数据框而不是指向Pandas数据帧的未来列表,我试过:

map

这就是我现在引起的索引问题。我应该如何修改它以从S3存储桶读取指向Pandas数据帧的未来列表,如博客文章中所述?

1 个答案:

答案 0 :(得分:2)

我建议阅读文档,而不是博客文章。旧的博客很可能很快就会过时。文档保持最新。

在此代码中,博客dfs是一个期货到数据框的列表,而不是一个dask数据框。

train = dfs[:-1]
test = dfs[-1]

如果您正在寻找进行火车测试分组,那么我建议使用random_split方法。

不支持位置索引,也不可能在近乎温和的未来。

如果您有完整的索引和分部,则支持基于值的索引(.loc),请参阅divisions docs