Lazy在Python中解析有状态的多行每记录数据流?

时间:2013-02-07 02:16:56

标签: python parsing

以下是一个文件的外观:

BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    data I
    wish to
    extract
 END_DB

我希望能够将所有cat'的无限流解析在一起,从而排除re.findall('something useful', '\n'.join(sys.stdin), re.M)之类的内容。

以下是我的尝试,但我必须强制从get_raw_table()返回的生成器,因此它不太符合要求。删除力意味着您无法测试返回的生成器是否为空,因此您无法看到sys.stdin是否为空。

def get_raw_table(it):
    state = 'begin'
    for line in it:
        if line.startswith('BEGIN_DB'):
            state = 'discard'
        elif line.startswith('END_DB'):
            return
        elif state is 'discard' and not line.strip():
            state = 'take'
        elif state is 'take' and line:
            yield line.strip().strip('#').split()

# raw_tables is a list (per file) of lists (per row) of lists (per column)
raw_tables = []
while True:
    result = list(get_raw_table(sys.stdin))
    if result:
        raw_tables.append(result)
    else:
        break

2 个答案:

答案 0 :(得分:4)

这样的事可能有用:

import itertools

def chunks(it):
    while True:
        it = itertools.dropwhile(lambda x: 'BEGIN_DB' not in x, it)
        it = itertools.dropwhile(lambda x: x.strip(), it)
        next(it)
        yield itertools.takewhile(lambda x: 'END_DB' not in x, it)

例如:

src = """
BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    1data I
    1wish to
    1extract
 END_DB


BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    2data I
    2wish to
    2extract
 END_DB
"""


src = iter(src.splitlines())
for chunk in chunks(src):
    for line in chunk:
        print line.strip()
    print

答案 1 :(得分:1)

您可以通过编程方式分离您的函数,使您的编程逻辑更有意义,并使您的代码更加模块化和灵活。尽量远离说

之类的东西
state = "some string"

因为如果将来你想要向这个模块添加一些东西会发生什么,那么你需要知道你的变量“状态”采用什么参数以及当它改变值时会发生什么。您无法保证记住这些信息,这可能会让您感到麻烦。编写函数来模仿这种行为更简洁,更容易实现。

def read_stdin():
    with sys.stdin as f:
        for line in f:
            yield line

def search_line_for_start_db(line):
    if "BEGIN DB" in line:
        search_db_for_info()

def search_db_for_info()
    while "END_DB" not in new_line: 
        new_line = read_line.next()
        if not new_line.strip():
            # Put your information somewhere
            raw_tables.append(line)

read_line = read_stdin()
raw_tables = []
while True:
    try:
        search_line_for_start_db(read_line.next())
    Except: #Your stdin stream has finished being read
        break #end your program