Question

我想通过熊猫读取.data文件。该文件的结构不正确。用pandas.read_csv（）读取文件后，我得到1列和1000行，但是我希望（也写在数据描述中）应该有21列和1000行。

我尝试了pandas.read_fwf（）函数。但是此功能仅适用于固定大小的列。就我而言，我有不固定大小的列（每列中都有不同数量的字符）。因此，即使我设置宽度或colspec来分隔各列（请参见https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_fwf.html），由于列的大小不固定，我也无法获得具有期望值的列。

colspecs=[[0,3], [3,6], [6,10], [10,14], [14,20], [34,40], [40,47], [47,55],
[55,63], [63,70], [70,78], [78,86], [86,94], [94,103][103,110],[110,116], [116,122], [122,130], [130,137], [137,144], [144,151]]

german_credit = pd.read_fwf("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data", colspecs=colspecs, header=None)
german_credit.columns = ["chk_acct", "duration", "credit_his", "purpose","amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor","present_resid", "property", "age", "other_install", "housing", "n_credits","job", "n_people", "telephone", "foreign", "response"]

我希望其中包含正确数据的列。但是在执行上述代码后，我得到了错误的列值。例如。对于“金额”列，我得到：

german_credit ['amount']。head（）

0 1169 A 1 5951 2 2096 3 7882 4 4870

名称：金额，dtype：对象

第一行是错误的，应该是数字。原因是在前面的列中，宽度变化的列。

如何读取具有非固定宽度的列的文件？

0 个答案: