在Python

时间:2015-09-24 16:17:36

标签: python pandas

我尝试使用pandas.read_fwf读取固定宽度的文件,请参阅下面的文件示例:

0000123456700123  
0001234567800045  

比如说,第0-11列是余额(格式为%12.2f),第11-16列是利率(格式为%6.2f)。所以我期望的输出数据框应如下所示:

     Balance  Int_Rate  
0   12345.67      1.23  
1  123456.78      0.45

这是我的代码,无需格式化即可阅读文件:

colspecs = [(0,11),(11,16)]  
header = ['Balance','Int_Rate']
df = pd.read_fwf("dataset",colspecs=colspecs, names=header)

我已经检查了pandas.read_fwf的文档,但是在导入过程中似乎无法将列格式化为选项。我之后是否需要更新格式,或者有更好的方法吗?

1 个答案:

答案 0 :(得分:1)

我曾经遇到同样的问题,我使用struct然后pandas

import struct
import pandas as pd

def parse_data_file(fieldwidths, fn):
    #
    # see https://docs.python.org/3.0/library/struct.html, for formatting and other info
    fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                         for fw in fieldwidths)
    fieldstruct = struct.Struct(fmtstring)
    umpack = fieldstruct.unpack_from

    # this part will dissect your data, per your fieldwiths
    parse = lambda line: tuple(s.decode() for s in umpack(line.encode()))
    df = []
    with open(fn, 'r') as f:
        for line in f:
            row = parse(line)
            df.append(row)
    return df

#
# test.txt file content, per below
# 6332      x102340   Darwin                                                                                              080007Darwin                                            1101
# 6332      x102342   Sydney                                                                                              200001Sydney                                            1101
file_location = "test.txt"
fieldwidths = (10 ,10 ,100 ,4 ,2 ,50 ,4)  # negative widths represent ignored padding fields

column_names = ['ID', 'LocationID', 'LocationName', 'PostCode', 'StateID', 'Address', 'CountryID']
fields = parse_data_file(fieldwidths=fieldwidths, fn=file_location)

# Pandas options
pd.options.display.width=500
pd.options.display.colheader_justify='left'

# assigned list into dataframe
df = pd.DataFrame(fields)
df.columns = column_names

print(df)

输出

    ID    LocationID  LocationName  PostCode StateID Address CountryID
    6332  x102340     Darwin        0800     07      Darwin  1101    
    6332  x102342     Sydney        2000     01      Sydney  1101