如何为以下用例编写正则表达式

时间:2014-12-23 06:27:22

标签: python regex

我有以下文字。

<!-- FEO DEBUG OUTPUT [TextTransAttempted:RENAME_JAVASCRIPT(9), RENAME_IMAGE(59), MINIFY_JAVASCRIPT(10), (1), EMBED_JAVASCRIPT(2), RENAME_CSS(3), (1), IMAGE_COMPRESSION(59);TextTransApplied:RENAME_JAVASCRIPT(9), RENAME_IMAGE(59), MINIFY_JAVASCRIPT(10), (1), EMBED_JAVASCRIPT(2), RENAME_CSS(3), (1), IMAGE_COMPRESSION(59);TagTransAttempted:(73);TagTransApplied:(73); ] -->

我需要获取标签以及数字。我在Python中有如下内容。

tag_list = re.findall(r'[A-Z]+(?:_[A-Z\d]+)+\(\d+\)', str(feed))
        for tag in tag_list:
            index = tag.index('(')
            result[tag[:index]] = int(tag.split("(")[1].rstrip(")"))
        print result

这会将输出打印为: -

{'RENAME_CSS': 3, 'IMAGE_COMPRESSION': 59, 'MINIFY_JAVASCRIPT': 10, 'RENAME_JAVASCRIPT': 9, 'RENAME_IMAGE': 59, 'EMBED_JAVASCRIPT': 2}

现在我只想对上面文中的应用进行此操作。例如,我想获得上述信息仅适用于&#39; TextTransApplie&#39;或者&#39; TagTransApplied&#39;

我尝试了以下方法: -

re.findall(r'TextTransApplied:[A-Z]+(?:_[A-Z\d]+)+\(\d+\)但这只给出了第一个值。如何获取所有应用值的全部值。

2 个答案:

答案 0 :(得分:1)

最好首先获取与TagTransApplied / TextTransApplied相关的所有内容,然后提取所需的部分:

import re

feed = """<!-- FEO DEBUG OUTPUT [TextTransAttempted:RENAME_JAVASCRIPT(9), RENAME_IMAGE(59), MINIFY_JAVASCRIPT(10), (1), EMBED_JAVASCRIPT(2), RENAME_CSS(3), (1), IMAGE_COMPRESSION(59);TextTransApplied:RENAME_JAVASCRIPT(9), RENAME_IMAGE(59), MINIFY_JAVASCRIPT(10), (1), EMBED_JAVASCRIPT(2), RENAME_CSS(3), (1), IMAGE_COMPRESSION(59);TagTransAttempted:(73);TagTransApplied:(73); ] -->"""

result = dict()
tagged = re.findall(r'T(?:ag|ext)TransApplied[^;]+', str(feed))
for part in tagged:
    tag_list = re.findall(r'[A-Z]+(?:_[A-Z\d]+)+\(\d+\)', part)
    for tag in tag_list:
        id = tag.index('(')
        result[tag[:id]] = int(tag.split("(")[1].rstrip(")"))
print result

结果:

{'RENAME_CSS': 3, 'IMAGE_COMPRESSION': 59, 'MINIFY_JAVASCRIPT': 10, 'RENAME_JAVASCRIPT': 9, 'RENAME_IMAGE': 59, 'EMBED_JAVASCRIPT': 2}

ideone demo

答案 1 :(得分:0)

尝试获取捕获组内的所有内容,然后处理字符串。
(我稍微修改了您现有的逻辑,我已将RENAME_JAVASCRIPT(9)更改为RENAME_JAVASCRIPT(19),只是为了说明区别)

import re
s = '<!-- FEO DEBUG OUTPUT [TextTransAttempted:RENAME_JAVASCRIPT(19), RENAME_IMAGE(59), MINIFY_JAVASCRIPT(10), (1), EMBED_JAVASCRIPT(2), RENAME_CSS(3), (1), IMAGE_COMPRESSION(59);TextTransApplied:RENAME_JAVASCRIPT(9), RENAME_IMAGE(59), MINIFY_JAVASCRIPT(10), (1), EMBED_JAVASCRIPT(2), RENAME_CSS(3), (1), IMAGE_COMPRESSION(59);TagTransAttempted:(73);TagTransApplied:(73); ] -->'
tag_list = re.findall(r'(?:TextTransAttempted|TextTransApplied):\s*((?:(?:[A-Z]+(?:_[A-Z\d]+)+)?\(\d+\)\s*(?:,\s*|;))*)', s)
for tag in tag_list:
    result = {}
    for e in tag.split(","):
        index = e.index('(')
        if e[:index].strip():
            result[e[:index].strip()] = (e.split("(")[1].rstrip(");"))
    print result


'''
OUTPUT
>>> 
{'RENAME_CSS': '3', 'IMAGE_COMPRESSION': '59', 'MINIFY_JAVASCRIPT': '10', 'RENAME_JAVASCRIPT': '19', 'RENAME_IMAGE': '59', 'EMBED_JAVASCRIPT': '2'}
{'RENAME_CSS': '3', 'IMAGE_COMPRESSION': '59', 'MINIFY_JAVASCRIPT': '10', 'RENAME_JAVASCRIPT': '9', 'RENAME_IMAGE': '59', 'EMBED_JAVASCRIPT': '2'}
'''