file1.txt具有以下几行:
[SUM] 0.00-34.53 sec 2.11 GBytes 524 Mbits/sec sender
[SUM] 0.00-34.53 sec 2.11 GBytes 524 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.75 sec 2.39 GBytes 591 Mbits/sec receiver
[SUM] 0.00-34.75 sec 2.39 GBytes 591 Mbits/sec receiver
[0] 0.00-34.53 sec 0.00 Bytes 0.00 bits/sec receiver
[0] 0.00-34.75 sec 0.00 Bytes 0.00 bits/sec sender
将以[SUM]开头,以发送者和接收者结尾的行打印到另一个文本文件file2.txt中。
下面是代码:
with open(r"C:\Users\file1.txt", 'r') as f:
contents = f.read()
s=contents
def my_function1():
regex = "^\s*\[SUM\]\s*[0-9\-\.]+\s+sec(?!\s+0\.00 Bytes).*sender.*"
items=re.findall(regex,s,re.MULTILINE)
for y in items:
file=open('file2.txt', "a")
file.write(str(y))
file.write("\n")
file.close()
def my_function2():
regex = "^\s*\[SUM\]\s*[0-9\-\.]+\s+sec(?!\s+0\.00 Bytes).*receiver.*"
items=re.findall(regex,s,re.MULTILINE)
for y in items:
file=open('file2.txt', "a")
file.write(str(y))
file.write("\n")
file.close()
#print(y)
my_function1()
my_function2()
将输出写入file2.txt中:
[SUM] 0.00-34.53 sec 2.11 GBytes 524 Mbits/sec sender
[SUM] 0.00-34.53 sec 2.11 GBytes 524 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.75 sec 2.39 GBytes 591 Mbits/sec receiver
[SUM] 0.00-34.75 sec 2.39 GBytes 591 Mbits/sec receiver
预期:仅打印一次事件
[SUM] 0.00-34.53 sec 2.11 GBytes 524 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.75 sec 2.39 GBytes 591 Mbits/sec receiver
答案 0 :(得分:1)
只需使用awk:
$ awk '/^\[SUM]/ && !seen[$0]++' file
[SUM] 0.00-34.53 sec 2.11 GBytes 524 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.75 sec 2.39 GBytes 591 Mbits/sec receiver
如您所见,您不需要像发布的示例输入那样复杂的正则表达式,但是如果您这样做了,那么也许正是您想要的(使用GNU awk作为\s
,与其他awk一起使用[[:space:]]
):
$ awk '/^\s*\[SUM]\s*[0-9.-]+\s+sec\s.*(sender|receiver)/ && !seen[$0]++' file
[SUM] 0.00-34.53 sec 2.11 GBytes 524 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.75 sec 2.39 GBytes 591 Mbits/sec receiver
答案 1 :(得分:1)
您不需要此处的re模块,也不必将所有内容加载到内存中
with open(r"C:\Users\file1.txt", 'r') as f, open('file2.txt', "w") as file:
seen = set() # use a set to only keep distinct lines
for line in f: # iterate the input file
lr = line.rstrip()
if line.startswith('one') and lr.endswith('apple'):
if lr not in seen:
seen.add(lr)
_ = file.write(line)
如果搜索实际上更复杂并且需要re
模块,我仍然会坚持一次处理一行并在循环外编译正则表达式:
with open(r"C:\Users\file1.txt", 'r') as f, open('file2.txt', "w") as file:
seen = set() # use a set to only keep distinct lines
rx = re.compile(pattern)
for line in f: # iterate the input file
lr = line.rstrip()
if rx.match(lr):
if lr not in seen:
seen.add(lr)
_ = file.write(line)
如果您需要搜索2个模式并确保第一个模式的匹配项写在第二个模式的匹配项之前,则可以使用:
patterns = ["^\s*\[SUM\]\s*[0-9\-\.]+\s+sec(?!\s+0\.00 Bytes).*sender.*",
"^\s*\[SUM\]\s*[0-9\-\.]+\s+sec(?!\s+0\.00 Bytes).*receiver.*"]
rxs = [re.compile(pattern) for pattern in patterns]
with open(r"C:\Users\file1.txt", 'r') as f:
data = [[], []]
seen = set() # use a set to only keep distinct lines
for line in f: # iterate the input file
lr = line.rstrip()
for i, rx in enumerate(rxs):
if rx.match(lr):
if lr not in seen:
seen.add(lr)
data[i].append(line)
with open('file2.txt', "w") as file:
for lst in data:
for line in lst:
_ = file.write(line)
print(file.getvalue())
它给出了预期的结果:
[SUM] 0.00-34.53 sec 2.11 GBytes 524 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.75 sec 2.39 GBytes 591 Mbits/sec receiver
答案 2 :(得分:0)
如果要获得唯一列表,只需添加:
list(set(items))
在写入文件之前