按标题排序csv数据但获得IndexError

时间:2018-02-27 03:44:59

标签: python pandas csv numpy

当我尝试设置要使用的数据标题/列时,我的代码似乎失败,在尝试解析标题时给出了索引错误

import pandas as pd
import quandl
import math, datetime
import numpy as np
from sklearn import preprocessing , cross_validation, svm
from sklearn.linear_model import LinearRegression
import scipy
import matplotlib.pyplot as plt
from matplotlib import style
import pickle

style.use('ggplot')
df = pd.read_csv('convertcsv.csv',sep='\t')

df = np.array(df)

print(df)


df = df[['Open','High','Low','Close','Volume (BTC)']]
print("ok")

df['HL_PCT'] = (df['High'] - df['Close']) / df['Close'] * 100.0
df['PCT_change'] = (df['Close'] - df['Open']) / df['Open'] * 100.0

df = df[['Close','HL_PCT','PCT_change','Volume (BTC)']]

forecast_col = 'Close'
df.fillna(-999999, inplace=True)

forecast_out = int(math.ceil(0.01*len(df)))


df['label'] = df[forecast_col].shift(-forecast_out)



X = np.array(df.drop(['label'],1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out:]


df.dropna(inplace=True)
y = np.array(df['label'])




X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, 
test_size=0.2)


clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
with open('linearregression.pickle','wb') as f:
pickle.dump(clf, f)

pickle_in = open('linearregression.pickle','rb')
clf =pickle.load(pickle_in)


accuracy = clf.score(X_test,y_test)
print(accuracy)


forecast_set = clf.predict(X_lately)




df['Forecast'] = np.nan

last_date = df.iloc[-1].name

last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day

for i in forecast_set:
    next_date = datetime.datetime.fromtimestamp(next_unix)
    next_unix += one_day
    df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)] + [i]


df['Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.pause(1)
plt.show()
print("we done?")`
...

我似乎无法弄清楚我做错了什么,它与之前使用的数据集一起工作,如果它有帮助,这里是我从中提取的csv文件的格式:

Timestamp,Open,High,Low,Close,Volume (BTC),Volume (Currency),Weighted Price
2017-09-30 00:00:00,4162.04,4177.63,4154.28,4176.08,114.81,478389.12,4166.96
2017-09-30 01:00:00,4170.84,4224.6,4170.84,4208.14,348.45,1463989.18,4201.4

我对这类东西并不太熟悉,我试图找到其他人有同样的错误,但每个人都遇到了不同的问题,如果需要,我可以包含更多数据。

1 个答案:

答案 0 :(得分:3)

您正在使用df = np.array(df)将数据帧转换为numpy数组。

不要期待一个numpy数组作为pandas数据帧。

删除

df = np.array(df)

您应该能够按列名称

对矩阵进行切片
df = df[['Open','High','Low','Close','Volume (BTC)']]