用几个子组读取csv

时间:2018-01-02 13:04:03

标签: python-2.7 pandas

我有一个csv文件包含" pivot-like"我想要存储到pandas DataFrame的数据。原始数据文件使用不同数量的空格进行划分,以区分数据透视数据中的级别,如下所示:

Text that I do not want to include,,
,Text that I do not want to include,Text that I do not want to include
,header A,header B
Total,100,100
A,,2.15
   a1,,2.15
B,,0.22
   b1,,0.22
"      slightly longer name"...,,0.22
         b3,,0.22
C,71.08,91.01
   c1,57.34,73.31
      c2,5.34,6.76
         c3,1.33,1.67
            x1,0.26,0.33
            x2,0.26,0.34
            x3,0.48,0.58
            x4,0.33,0.42
         c4,3.52,4.33
            x5,0.27,0.35
            x6,0.21,0.27
            x7,0.49,0.56
            x8,0.44,0.47
            x9,0.15,0.19
            x10,,0.11
            x11,0.18,0.23
            x12,0.18,0.23
            x13,0.67,0.85
            x14,0.24,0.2
            x15,0.68,0.87
         c5,0.48,0.76
            x16,,0.15
            x17,0.3,0.38
            x18,0.18,0.23
      d2,6.75,8.68
         d3,0.81,1.06
            x19,0.3,0.38
            x20,0.51,0.68
Others,24.23,0
N/A,,
"Text that I do not want to include(""at all"") ",,

(看起来很糟糕,但你应该能够粘贴在例如记事本中,看得更清楚了)

基本上,只有两列ab,但行使用036,{{1}缩进},...等空格来区分级别。例如,

  • 零级别,主要群组9A个空格,
  • 第一级0a1个空格,
  • 第二级3a2个空格,
  • 第三级6a3个空格和
  • 第四级和最后一级有9个空格,分别为12a列的相应值。

我现在希望能够在这些级别上阅读和分组这些数据,以便创建一个新的汇总DataFrame,其列对应于这些不同的级别,如下所示:

b

关于如何做到这一点的任何线索?

由于

1 个答案:

答案 0 :(得分:1)

最简单的方法是将其拆分为不同的功能

  1. 阅读文件
  2. 解析行
  3. 生成'tree'
  4. 构建DataFrame
  5. 解析行

    def parse_file(file):
        import ast
        import re
        pat = re.compile(r'^( *)(\w+),([\d.]+),([\d.]+)$')
        for line in file:
            r = pat.match(line)
            if r:
                spaces, label, a, b = r.groups()
                diff = ast.literal_eval(a) - ast.literal_eval(b)
                yield len(spaces)//3, label, diff
    

    读取每一行,使用正则表达式生成level,'label'和diff。我使用ast将字符串转换为intfloat

    生成树

    def parse_lines(lines):
        previous_label = list(range(5))
        for level, label, diff in lines:
            previous_label[level] = label
            if level == 4:
                yield tuple(previous_label), diff
    

    启动长度为5的list,然后覆盖此节点所在的级别。

    构造DataFrame

    with StringIO(file_content) as file:
        lines = parse_file(file)
        index, data = zip(*parse_lines(lines))
        idx = pd.MultiIndex.from_tuples(index, names=[f'level_{i}' for i in range(len(index[0]))])
        df = pd.DataFrame(data={'Diff(a,b)': list(data)}, index=idx)
    

    打开文件,构造索引并生成索引中具有不同级别的DataFrame。如果您不想这样,可以添加.reset_index()或构建DataFrame略有不同的

    df
    
    level_0 level_1 level_2 level_3 level_4 Diff(a,b)
    A   a1  a2  a3  x1  -0.07
    A   a1  a2  a3  x2  -0.08000000000000002
    A   a1  a22 a3  x3  -0.04999999999999999
    A   a1  a22 a3  x4  -0.04000000000000001
    A   a1  a22 a3  x5  -0.03
    A   a1  a22 a3  x6  -0.06999999999999998
    C   c1  c2  c3  x7  525.0
    C   c1  c2  c3  x8  -0.08000000000000002
    

    缺少级别的替代

    def parse_lines(lines):
        labels = [None] * 5
        previous_level = None
        for level, label, diff in lines:
            labels[level] = label
            if level == 4:
                if previous_level < 3:
                    labels = labels[:previous_level + 1]  + [None] * (5 - previous_level)
                    labels[level] = label
                yield tuple(labels), diff
            previous_level = level
    

    a22下的项目似乎没有level_3,因此它会复制前一项。如果这是不需要的,您可以采取此变体

    df
    
    level_0 level_1 level_2 level_3 level_4 Diff(a,b)
    C   c1  c2  c3  x1  -0.07
    C   c1  c2  c3  x2  -0.08000000000000002
    C   c1  c2  c3  x3  -0.09999999999999998
    C   c1  c2  c3  x4  -0.08999999999999997
    C   c1  c2  c4  x5  -0.07999999999999996
    C   c1  c2  c4  x6  -0.060000000000000026
    C   c1  c2  c4  x7  -0.07000000000000006
    C   c1  c2  c4  x8  -0.02999999999999997
    C   c1  c2  c4  x9  -0.04000000000000001
    C   c1  c2  c4  x11 -0.05000000000000002
    C   c1  c2  c4  x12 -0.05000000000000002
    C   c1  c2  c4  x13 -0.17999999999999994
    C   c1  c2  c4  x14 0.03999999999999998
    C   c1  c2  c4  x15 -0.18999999999999995
    C   c1  c2  c5  x17 -0.08000000000000002
    C   c1  c2  c5  x18 -0.05000000000000002
    C   c1  d2  d3  x19 -0.08000000000000002
    C   c1  d2  d3  x20 -0.17000000000000004