从文件中提取字符串列表

时间:2018-12-04 19:41:18

标签: python regex

我必须从文件中提取3个字符串,如下所示:

我只需要提取关键字“ >> For”之前的3个字符串

我编写了以下代码来提取字符串列表,但是无法正确提取:

https://pastebin.com/kRd0ecK3

上述文件的预期结果:

import re
import sys

contents = "JLYLFPMKKLZDSRLBTEKH                                        KMZMGQNLLMAETSMCUFLI                                         KXKEOLJJKYCRQKASDJG                    J                    LYLFPMKKLZDSRLBTEKH                    K                    MZMGQNLLMAETSMCUFLI                    L                    KXKEOLJJKYCRQKASDJGJ                                        LYLFPMKKLZDSRLBTEKHK                                        MZMGQNLLMAETSMCUFLIL                                        KXKEOLJJKYCRQKASDJGJ                                        LYLFPMKKLZDSRLBTEKHK                                        MZMGQNLLMAETSMCUFLIL                    >> For"

m = re.match(r'(.*)[A-Z]{20}\s{40}(.*)\s{20}>> For', contents)

if m:
    print m.group(1)

3 个答案:

答案 0 :(得分:1)

re.findall('(\w{20}\s+\w{20}\s+\w{20}\s+)>> For', x)[0].split()

这应该返回您想要的内容:

['KXKEOLJJKYCRQKASDJGJ', 'LYLFPMKKLZDSRLBTEKHK', 'MZMGQNLLMAETSMCUFLIL']

答案 1 :(得分:1)

您可以使用此正则表达式,

([A-Z]{20})\s+([A-Z]{20})\s+([A-Z]{20})\s+>>\s*For

并捕获组1,组2和组3

Demo

示例python代码,

import re
contents = 'JLYLFPMKKLZDSRLBTEKH                                        KMZMGQNLLMAETSMCUFLI                                         KXKEOLJJKYCRQKASDJG                    J                    LYLFPMKKLZDSRLBTEKH                    K                    MZMGQNLLMAETSMCUFLI                    L                    KXKEOLJJKYCRQKASDJGJ                                        LYLFPMKKLZDSRLBTEKHK                                        MZMGQNLLMAETSMCUFLIL                                        KXKEOLJJKYCRQKASDJGJ                                        LYLFPMKKLZDSRLBTEKHK                                        MZMGQNLLMAETSMCUFLIL                    >> For'
m = re.match(r'.*([A-Z]{20})\s+([A-Z]{20})\s+([A-Z]{20})\s+>>\s*For', contents)
if m:
 print(m.group(1))
 print(m.group(2))
 print(m.group(3))

哪些印刷品

KXKEOLJJKYCRQKASDJGJ
LYLFPMKKLZDSRLBTEKHK
MZMGQNLLMAETSMCUFLIL

答案 2 :(得分:1)

简单而愚蠢的非正则表达式解决方案,使用不带分隔符的split,因此它不关心换行符,空格等...

contents = "JLYLFPMKKLZDSRLBTEKH                                        KMZMGQNLLMAETSMCUFLI                                         KXKEOLJJKYCRQKASDJG                    J                    LYLFPMKKLZDSRLBTEKH                    K                    MZMGQNLLMAETSMCUFLI                    L                    KXKEOLJJKYCRQKASDJGJ                                        LYLFPMKKLZDSRLBTEKHK                                        MZMGQNLLMAETSMCUFLIL                                        KXKEOLJJKYCRQKASDJGJ                                        LYLFPMKKLZDSRLBTEKHK                                        MZMGQNLLMAETSMCUFLIL                    >> For"

toks = contents.split()
for i in range(len(toks)-1):
    if toks[i]==">>" and toks[i+1]=="For":
        print(toks[i-3:i])
        break

打印:

['KXKEOLJJKYCRQKASDJGJ', 'LYLFPMKKLZDSRLBTEKHK', 'MZMGQNLLMAETSMCUFLIL']