Question

我目前在python中遇到问题，我不知道如何解决效率问题。我有一大组时间序列数据被读入生成器。截至目前，当我调用yield时，每个数据都会一个接一个地返回给我。当每个时间序列具有相同的索引时，这都很好，其中每个时间序列在同一日期开始并在同一日期结束。问题是当我有一组时间序列数据没有相同的开始日期，但具有相同的结束日期。

什么是最佳实现，当我查询时，它将返回该特定日期的值。这样我就不用担心开始日期了。它就像时间点。

我使用pandas，目前还没有关于如何有效实现这一点的线索。

我用于按文件导入csv文件的代码：

def _open_convert_csv_files(self):

    comb_index = None
    for s in self.symbol_list:
        print s
        # Load the CSV file with no header information, indexed on date
        self.symbol_data[s] = pd.io.parsers.read_csv(
                                  os.path.join(self.csv_dir, '%s.csv' % s),
                                  header=0, index_col=0, parse_dates=True,
                                  names=['Date','Open','High','Low','Close','Total Volume']
                              ).sort()


        # Combine the index to pad forward values
        if comb_index is None:
            comb_index = self.symbol_data[s].index
        else:
            comb_index.union(self.symbol_data[s].index)

        # Set the latest symbol_data to None
        self.latest_symbol_data[s] = []

    print ''
    # Reindex the dataframes
    for s in self.symbol_list:
        self.symbol_data[s] = self.symbol_data[s].reindex(index=comb_index, method='pad').iterrows()

正如您所看到的，self.symbol_data[s]在时间序列具有相同的开始日期时工作正常，但是当它们没有时，它在模拟过程中不会起作用，我循环遍历循环中的每个符号以获得数据。我需要考虑的另一个词来考虑每个迭代日期的横截面价格数据

喜欢听别人为实现这一目标所做的工作。

据我所知，我们可以将它们排列在一起，以便它们的日期匹配并逐行循环，但是当我有100k种不同的证券时，记忆速度很慢。此外，每个csv文件不是一列而是多列......

感谢，

Date    Open    High    Low Close   Total Volume
19991118    29.69620186 32.63318885 26.10655108 28.71720619 685497
19991119    28.02375093 28.06454241 25.98417662 26.3513 166963
19991122    26.96317229 28.71720619 26.14734257 28.71720619 72092
19991123    27.73821052 28.47245727 26.10655108 26.10655108 65492
19991124    26.18813405 27.37108715 26.10655108 26.80000634 53081
19991126    26.67763189 27.08554675 26.59604891 26.88158932 18955

Answer 1

让我们从这开头：

pd.read_csv(file_path, parse_dates=True, index_col=0)
                 Open       High        Low      Close  Total Volume
Date                                                                
1999-11-18  29.696202  32.633189  26.106551  28.717206        685497
1999-11-19  28.023751  28.064542  25.984177  26.351300        166963
1999-11-22  26.963172  28.717206  26.147343  28.717206         72092
1999-11-23  27.738211  28.472457  26.106551  26.106551         65492
1999-11-24  26.188134  27.371087  26.106551  26.800006         53081
1999-11-26  26.677632  27.085547  26.596049  26.881589         18955

这对您的需求来说还不够？

Answer 2

假设您的数据如下：

DataFrame

您现在可以将所有数据加载到Symbol个数组中，添加In [54]: symbol_list = ['aa1', 'aa2'] In [55]: result = [] In [56]: for symbol in symbol_list: ....: data = pd.read_csv(symbol + '.csv', parse_dates=True) ....: data['Symbol'] = symbol ....: result.append(data) ....:列以识别它。

In [57]: combined = pd.concat(result).pivot_table(
   ....:     index='Date',
   ....:     columns='Symbol',
   ....:     values=['Open', 'High', 'Low', 'Close', 'Total Volume']
   ....: ).ffill().reorder_levels([1, 0], axis=1)

In [58]: combined
Out[58]:
Symbol      aa1  aa2  aa1  aa2 aa1 aa2   aa1   aa2          aa1          aa2
           Open Open High High Low Low Close Close Total Volume Total Volume
Date
1999-11-18   29   50   32   51  26  49    30    50        10000         9000
1999-11-19   29   50   32   52  26  48    30    50        10000         8000
1999-11-20   30   50   33   52  27  48    31    50         9000         8000
1999-11-21   30   50   33   53  27  47    31    50         9000         7000
1999-11-22   31   50   34   53  28  47    32    50         8000         7000

这些可以连接成单个DataFrame ，这样可以轻松操作。我们制作一个数据透视表，向前填充缺失值，并为方便起见重新排序列级别。

.pivot_table()

Close操作会自动为您创建综合索引，您还可以比较代码中的指标（例如，符号之间的@(?!.*?\.\.)[^@]+$）。

由于这没有任何循环，因此它应该相当有效。

高效的时间序列数据提取

2 个答案: