如何使用熊猫从多个CSV文件中提取特定列

时间:2018-10-09 23:18:36

标签: python pandas

我正在尝试从多个Adj Close文件中提取csv列。

csv文件样本(可以将相同内容复制到aapl.csvmsft.csvhcp.csv

Date    Open    High    Low Close   Volume  Adj Close
10/14/08    116.26  116.4   103.14  104.08  70749800    104.08
10/13/08    104.55  110.53  101.02  110.26  54967000    110.26
10/10/08    85.7    100 85  96.8    79260700    96.8
10/9/08 93.35   95.8    86.6    88.74   57763700    88.74
10/8/08 85.91   96.33   85.68   89.79   78847900    89.79
10/7/08 100.48  101.5   88.95   89.16   67099000    89.16
10/6/08 91.96   98.78   87.54   98.14   75264900    98.14
10/3/08 104 106.5   94.65   97.07   81942800    97.07
10/2/08 108.01  108.79  100 100.1   57477300    100.1
10/1/08 111.92  112.36  107.39  109.12  46303000    109.12
9/30/08 108.25  115 106.3   113.66  58095800    113.66
9/29/08 119.62  119.68  100.59  105.26  93581400    105.26
9/26/08 124.91  129.8   123 128.24  40208700    128.24
9/25/08 129.8   134.79  128.52  131.93  35865600    131.93
9/24/08 127.27  130.95  125.15  128.71  37393400    128.71
9/23/08 131.85  135.8   126.66  126.84  45727300    126.84

我的代码是:

import pandas as pd
def test_run():
    start_date = '2008-10-01'
    end_date = '2008-10-09'
    dates = pd.date_range(start_date, end_date)
    df1 = pd.DataFrame(index=dates)
    dfSPY = pd.read_csv(
        'aapl.csv',
        index_col='Date',
        parse_dates=True,
        usecols=['Date', 'Adj Close'],
        na_values=['nan'])
    df1 = df1.join(dfSPY, how='inner')
    df1 = df1.rename(columns={'Adj Close':'SPY'})
    symbols = ['aapl', 'msft', 'hcp']
    for sym in symbols:
        df_temp = pd.read_csv(
            '{}.csv'.format(sym),
            index_col='Date',
            parse_dates=True,
            usecols=['Date', 'Adj Close'],
            na_values=['nan'])
        df_temp = df_temp.rename(columns={'Adj Close':sym})
        df1 = df1.join(df_temp, how='left')
    print(df1)
if __name__ == "__main__":
    test_run()

运行时出现错误:

Traceback (most recent call last):
  File "1.py", line 27, in <module>
    test_run()
  File "1.py", line 22, in test_run
    na_values=['nan'])
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1749, in __init__
    _validate_usecols_names(usecols, self.orig_names)
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1134, in _validate_usecols_names
    "columns expected but not found: {missing}".format(missing=missing)
ValueError: Usecols do not match columns, columns expected but not found: ['Adj Close']

我尝试引用多个链接,但无法弄清楚我在这里缺少什么。预先感谢。

1 个答案:

答案 0 :(得分:1)

鉴于错误的堆栈跟踪(感谢发布完整跟踪),您会注意到该函数在第27行(代码底部的test_run())中被调用。但是,该错误开始于第22行的5行。这是df_temp的初始分配,这是调用pd.read_csv函数的 second 时间。因为它是在dfSPY上首次使用几乎相同的参数工作的,所以这意味着一种或多种证券的文件格式必须有所不同。可能您的文件之一没有“关闭关闭”字段,或者文件周围有空白需要修剪。