根据另一个帖子的要求,我在下面给出了用于从具有以下格式的文件中删除文本块的代码。如前所列,我试图通过此代码解决的问题如下,
确定文本块的逻辑是计算'{'& '}'在该部分中的大括号(因为文本块在该部分中包含许多大括号)。有一点需要注意的是,有一个'开始'和包含块文本的“结束”行(即括号)。在我的代码中,我试图跟踪这两个,因为file1的文件格式有时可能没有'Begin'/'End'行,但是大括号将始终存在。
我需要有关如何为运行时改进此代码以及简洁(代码优化)的建议。请注意,file1是一个包含数十万行的巨大文件,但file2很小,大约是100行。我试图在代码中尽可能地添加注释,以使其更容易阅读。
file1格式如下所示
/* Begin : abcxyz*/
cell ("pattern1") {
/* ---------------------------------------------------------------------- */
/* Comment lines */
/* ---------------------------------------------------------------------- */
line 1
line 2 {
}
line 3
}
/* End : abcxyz*/
下面列出了实际代码,
import sys # Module to work with argv parameters
import re # Module to work with regular expressions
in_file = sys.argv[1] # Setting the first argument as file1.
pattern_file = sys.argv[2] # Setting the second argument as file2 (containing the patterns to be parsed in file1).
patterns = [] # Creating a empty list for populating the pattern details.
with open (pattern_file, 'r') as file2: # Opening the file2 in read mode.
for pattern in sorted(set(file2.readlines())): # Sorting and making the pattern list unique while reading each pattern.
patterns.append(pattern.rstrip('\n')) # Stripping the newline character and building the pattern array set.
out_flag = False
forward_brace = 0
backward_brace = 0
scope_count = 0
out_file = open ("file3", 'w')
with open (in_file, 'r') as in_lib: # Opens the input file(file1) for reading.
for line in in_lib.readlines(): # Reads the entire content of the file1 in the form of list
# Creates a generator expression; tries to get the first match from the file1 based on the pattern list
# Once the begin block is found, the flag to write the output file is set to True
if any(p in line for p in patterns) and 'Begin ' in line:
forward_brace = backward_brace = 0
out_flag = True
# For all the lines other than 'Begin' statement, the brace count is calculated.
# The brace count is kept track for determining the scope of the cell block.
else:
# Matches any line starting with '/*' or '*' to avoid counting the brace for scope determination.
if any(re.match(r, line) for r in ['^\s*/','^\s*[*]']):
out_file.write(line)
continue
else:
forward_brace += line.count('{')
backward_brace += line.count('}')
scope_count = forward_brace - backward_brace
# Boolean check on flag performed for writing to the output file.
# If the 'End ' block is arrived at then the out_flag is set to False.
if out_flag:
out_file.write(line)
if 'End ' in line:
out_flag = False
# Once the end of the scope block is arrived at ie., brace count is 0 and
# also flag is set to False, the tracking variables are reset for next cell.
if scope_count == 0 and not (out_flag):
forward_brace = backward_brace = 0
out_flag = False
out_file.close()