Question

根据另一个帖子的要求，我在下面给出了用于从具有以下格式的文件中删除文本块的代码。如前所列，我试图通过此代码解决的问题如下，

使用从另一个文件（file2）创建的模式解析文件（file1）中的文本块。 file1和file2都作为命令行参数提供。

确定文本块的逻辑是计算'{'＆amp; '}'在该部分中的大括号（因为文本块在该部分中包含许多大括号）。有一点需要注意的是，有一个'开始'和包含块文本的“结束”行（即括号）。在我的代码中，我试图跟踪这两个，因为file1的文件格式有时可能没有'Begin'/'End'行，但是大括号将始终存在。

我需要有关如何为运行时改进此代码以及简洁（代码优化）的建议。请注意，file1是一个包含数十万行的巨大文件，但file2很小，大约是100行。我试图在代码中尽可能地添加注释，以使其更容易阅读。

file1格式如下所示

/* Begin : abcxyz*/
cell ("pattern1") {
/* ---------------------------------------------------------------------- */
/* Comment lines */
/* ---------------------------------------------------------------------- */
line 1
line 2 {
}
line 3
}
/* End : abcxyz*/

下面列出了实际代码，

import sys # Module to work with argv parameters
import re # Module to work with regular expressions

in_file = sys.argv[1] # Setting the first argument as file1.
pattern_file = sys.argv[2] # Setting the second argument as file2 (containing the patterns to be parsed in file1).

patterns = [] # Creating a empty list for populating the pattern details.
with open (pattern_file, 'r') as file2: # Opening the file2 in read mode.
    for pattern in sorted(set(file2.readlines())): # Sorting and making the pattern list unique while reading each pattern.
        patterns.append(pattern.rstrip('\n')) # Stripping the newline character and building the pattern array set.

out_flag = False
forward_brace = 0
backward_brace = 0
scope_count = 0
out_file = open ("file3", 'w')
with open (in_file, 'r') as in_lib: # Opens the input file(file1) for reading.
    for line in in_lib.readlines(): # Reads the entire content of the file1 in the form of list
        # Creates a generator expression; tries to get the first match from the file1 based on the pattern list
        # Once the begin block is found, the flag to write the output file is set to True
        if any(p in line for p in patterns) and 'Begin ' in line:
            forward_brace = backward_brace = 0
            out_flag = True
        # For all the lines other than 'Begin' statement, the brace count is calculated.
        # The brace count is kept track for determining the scope of the cell block.
        else:
            # Matches any line starting with '/*' or '*' to avoid counting the brace for scope determination.
            if any(re.match(r, line) for r in ['^\s*/','^\s*[*]']):
                out_file.write(line)
                continue
            else:
                forward_brace += line.count('{')
                backward_brace += line.count('}')
                scope_count = forward_brace - backward_brace
        # Boolean check on flag performed for writing to the output file.
        # If the 'End ' block is arrived at then the out_flag is set to False.
        if out_flag:
            out_file.write(line)
            if 'End ' in line:
                out_flag = False
        # Once the end of the scope block is arrived at ie., brace count is 0 and
        # also flag is set to False, the tracking variables are reset for next cell.
        if scope_count == 0 and not (out_flag):
            forward_brace = backward_brace = 0
            out_flag = False

out_file.close()

使用python剥离文本块

0 个答案: