使用熊猫读取NCES IPEDS CSV文件时遇到问题

时间:2018-12-12 02:09:23

标签: pandas csv unicode

在下载和读取美国教育部国家教育统计中心提供的csv文件时遇到了麻烦。以下是应该为可能有兴趣帮助我进行故障排除的人员运行的代码。

import requests, zipfile, io

# First example shows that the code can work. Works fine on years 2005
# and earlier.
url = 'https://nces.ed.gov/ipeds/datacenter/data/HD2005_Data_Stata.zip'
r_zip_file_2005 = requests.get(url, stream=True)
z_zip_file_2005 = zipfile.ZipFile(io.BytesIO(r_zip_file_2005.content))
z_zip_file_2005.extractall('.')
csv_2005_df = pd.read_csv('hd2005_data_stata.csv')

# Second example shows that something changed in the CSV files after
# 2005 (or seems to have changed).
url = 'https://nces.ed.gov/ipeds/datacenter/data/HD2006_Data_Stata.zip'
r_zip_file_2006 = requests.get(url, stream=True)
z_zip_file_2006 = zipfile.ZipFile(io.BytesIO(r_zip_file_2006.content))
z_zip_file_2006.extractall('.')
csv_2006_df = pd.read_csv('hd2006_data_stata.csv')

在2006年,Python进行了加薪:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 18: invalid start byte
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-26-b26a150e37ee> in <module>()
----> 1 csv_2006_df = pd.read_csv('hd2006_data_stata.csv')

有关如何克服这一问题的任何提示?

1 个答案:

答案 0 :(得分:0)

只用了7个月...想出了我的答案。不是火箭科学。

csv_2006_df = pd.read_csv('hd2006_data_stata.csv', 
                          encoding='ISO-8859-1')