Question

如何阅读格式怪异的数据文件？

例如，如果有不同类型的分隔符（，：|）全部一起使用？

查看数据框示例，其中包含以下内容：

Answer 1

对怪异数据的巨大反应。首先，拆分包含k：v对的每列并将它们转换为pandas Series。结合所有三个＆＃34;其他＆＃34;列成一个数据帧：

others = pd.concat(data[x].str.split(':').apply(pd.Series) 
                   for x in ('Other1', 'Other2', 'Other3')).dropna(how='all')

#                  0                  1
#0          Hospital   Awesome Hospital
#1           Hobbies            Cooking
#2          Hospital   Awesome Hospital
#0       Maiden Name              Rubin
#1  Hobby Experience           10 years
#2       Maiden Name            Simpson
#0               DOB         2015/04/09
#2               DOB         2015/04/16

进行一些索引操作（我们希望键成为列名）：

others = others.reset_index().set_index(['index',0]).unstack()
#                 1                                                          
#0              DOB   Hobbies Hobby Experience           Hospital Maiden Name
#index                                                                       
#0       2015/04/09      None             None   Awesome Hospital       Rubin
#1             None   Cooking         10 years               None        None
#2       2015/04/16      None             None   Awesome Hospital     Simpson

删除unstack()生成的分层列索引：

others.columns = others.columns.get_level_values(0)

再次拼凑：

pd.concat([data[["Full Name","Town"]], others], axis=1)

Answer 2

parse有一个很好的界面，可能是拉出这样的数据的好选择：

>>> import parse
>>> format_spec='{}: {}' 
>>> string='Hobbies: Cooking'
>>> parse.parse(format_spec, string).fixed
('Hobbies', 'Cooking')

如果要反复解析相同的规范，请使用compile：

>>> other_parser = parse.compile(format_spec)
>>> other_parser.parse(string).fixed
('Hobbies', 'Cooking')
>>> other_parser.parse('Maiden Name: Rubin').fixed
('Maiden Name', 'Rubin')

fixed属性将解析的参数作为元组返回。使用这些元组，我们可以创建一堆字典，将它们提供给pd.DataFrame，并与第一个df合并：

import parse
import pandas as pd

# slice first two columns from original dataframe
first_df = pd.read_csv(filepath, sep='t').ix[:,0:2]

# make the parser
other_parser = parse.compile('{}: {}')

# parse remaining columns to a new dataframe
with open(filepath) as f:
    # a generator of dict objects is fed into DataFrame
    # the dict keys are column names
    others_df = pd.DataFrame(dict(other_parser.parse(substr).fixed for substr in line.split('\t')[2:]) for line in f)

# merge on the indexes
df = pd.merge(first_df, others_df, left_index=True, right_index=True)

如何解析格式奇怪的数据文件？

2 个答案: