如何使用python基于缩进解析层次结构

时间:2017-08-30 15:46:09

标签: python

我有一个会计树,它在源代码中存有缩进/空格:

Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses

有一定数量的级别,所以我想通过使用3个字段来平整层次结构(实际数据有6个级别,例如简化):

L1       L2            L3
Income
Income   Revenue
Income   Revenue       IAP
Income   Revenue       Ads
Income   Other-Income
Expenses Developers    In-house
 ... etc

我可以通过检查帐户名称之前的空格数来执行此操作:

for rownum in range(6,ws.max_row+1):
   accountName = str(ws.cell(row=rownum,column=1).value)
   indent = len(accountName) - len(accountName.lstrip(' '))
   if indent == 0:
      l1 = accountName
      l2 = ''
      l3 = ''
   elif indent == 3:
      l2 = accountName
      l3 = ''
   else:
      l3 = accountName

   w.writerow([l1,l2,l3])

是否有更灵活的方法来实现这一点,基于当前行与前一行相比的缩进而不是假设每个级别总是3个空格? L1将始终没有缩进,我们可以相信较低级别将比其父级缩进,但每个级别可能不总是3个空格。

更新,最后将其作为逻辑的核心,因为我最终想要带有内容的帐户列表,使用缩进来决定是否重置,追加或弹出列表似乎最简单:

        if indent == 0:
            accountList = []
            accountList.append((indent,accountName))
        elif indent > prev_indent:
            accountList.append((indent,accountName))
        elif indent <= prev_indent:
            max_indent = int(max(accountList,key=itemgetter(0))[0])
            while max_indent >= indent:
                accountList.pop()
                max_indent = int(max(accountList,key=itemgetter(0))[0])
            accountList.append((indent,accountName))

因此,在每行输出中,accountList都已完成。

2 个答案:

答案 0 :(得分:5)

你可以模仿Python实际解析缩进的方式。 首先,创建一个包含缩进级别的堆栈。 在每一行:

  • 如果压痕大于堆叠顶部,请按下它并增加深度级别。
  • 如果相同,请继续保持同一级别。
  • 如果它较低,则在高于新缩进时弹出堆栈顶部。 如果在找到完全相同之前找到较低的缩进级别,则会出现缩进错误。
indentation = []
indentation.append(0)
depth = 0

f = open("test.txt", 'r')

for line in f:
    line = line[:-1]

    content = line.strip()
    indent = len(line) - len(content)
    if indent > indentation[-1]:
        depth += 1
        indentation.append(indent)

    elif indent < indentation[-1]:
        while indent < indentation[-1]:
            depth -= 1
            indentation.pop()

        if indent != indentation[-1]:
            raise RuntimeError("Bad formatting")

    print(f"{content} (depth: {depth})")

使用&#34; test.txt&#34;文件的内容与您提供的一致:

Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses

这是输出:

Income (depth: 0)
Revenue (depth: 1)
IAP (depth: 2)
Ads (depth: 2)
Other-Income (depth: 1)
Expenses (depth: 0)
Developers (depth: 1)
In-house (depth: 2)
Contractors (depth: 2)
Advertising (depth: 1)
Other Expense (depth: 1)

那么,你能做些什么呢? 假设您要构建嵌套列表。 首先,创建一个数据堆栈。

  • 找到缩进后,在数据堆栈的末尾附加一个新列表。
  • 当您找到未经注释时,请弹出顶部列表,然后将其附加到新的顶部。

无论如何,对于每一行,将内容附加到数据堆栈顶部的列表中。

以下是相应的实现:

for line in f:
    line = line[:-1]

    content = line.strip()
    indent = len(line) - len(content)
    if indent > indentation[-1]:
        depth += 1
        indentation.append(indent)
        data.append([])

    elif indent < indentation[-1]:
        while indent < indentation[-1]:
            depth -= 1
            indentation.pop()
            top = data.pop()
            data[-1].append(top)

        if indent != indentation[-1]:
            raise RuntimeError("Bad formatting")

    data[-1].append(content)

while len(data) > 1:
    top = data.pop()
    data[-1].append(top)

您的嵌套列表位于data堆栈的顶部。 同一文件的输出是:

['Income',
    ['Revenue',
        ['IAP',
         'Ads'
        ],
     'Other-Income'
    ],
 'Expenses',
    ['Developers',
        ['In-house',
         'Contractors'
        ],
     'Advertising',
     'Other Expense'
    ]
 ]

这很容易操作,虽然嵌套很深。 您可以通过链接项目访问来访问数据:

>>> l = data[0]
>>> l
['Income', ['Revenue', ['IAP', 'Ads'], 'Other-Income'], 'Expenses', ['Developers', ['In-house', 'Contractors'], 'Advertising', 'Other Expense']]
>>> l[1]
['Revenue', ['IAP', 'Ads'], 'Other-Income']
>>> l[1][1]
['IAP', 'Ads']
>>> l[1][1][0]
'IAP'

答案 1 :(得分:2)

如果缩进是固定数量的空格(此处为3个空格),则可以简化缩进级别的计算。

注意:我使用StringIO来模拟文件

import io
import itertools

content = u"""\
Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses
"""

stack = []
for line in io.StringIO(content):
    content = line.rstrip()  # drop \n
    row = content.split("   ")
    stack[:] = stack[:len(row) - 1] + [row[-1]]
    print("\t".join(stack))

你得到:

Income
Income  Revenue
Income  Revenue IAP
Income  Revenue Ads
Income  Other-Income
Expenses
Expenses    Developers
Expenses    Developers  In-house
Expenses    Developers  Contractors
Expenses    Advertising
Expenses    Other Expenses

编辑:缩进未修复

如果缩进没有修复(你并不总是有3个空格),如下例所示:

content = u"""\
Income
   Revenue
    IAP
    Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
  Advertising
  Other Expenses
"""

您需要估算每个新行的转移:

stack = []
last_indent = u""
for line in io.StringIO(content):
    indent = "".join(itertools.takewhile(lambda c: c == " ", line))
    shift = 0 if indent == last_indent else (-1 if len(indent) < len(last_indent) else 1)
    index = len(stack) + shift
    stack[:] = stack[:index - 1] + [line.strip()]
    last_indent = indent
    print("\t".join(stack))