Python中编译的正则表达式列表

时间:2014-03-11 16:18:29

标签: python regex

我有很多替换模式,我需要进行文本清理。我出于性能原因从数据库加载数据并编译正则表达式。 不幸的是,在我的方法中,只有变量“text”的最后一个赋值似乎是有效的,而其他变量似乎被覆盖了:

# -*- coding: utf-8 -*-
import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
        text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
        print text

extract_from_db()

有谁知道如何以优雅的方式解决这个问题?或者我是否必须通过巨大的if / elif控制结构来解决这个问题?

4 个答案:

答案 0 :(得分:7)

您继续使用str(row[0])上的替换结果替换最后一个结果。使用text代替累积替换:

text = REPLACE_1.sub(r'REPLACE_1', str(row[0]))
text = REPLACE_1.sub(r'REPLACE_1', text)
# ..
text = REPLACE_99.sub(r'REPLACE_99', text)
text = REPLACE_100.sub(r'REPLACE_199', text)

您最好使用实际列表:

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), r'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), r'REPLACE_2'),
    # ..
    (re.compile(r'(sample_pattern_99)'), r'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), r'REPLACE_100'),
]

并在循环中使用它们:

text = str(row[0])
for pattern, replacement in REPLACEMENTS:
    text = pattern.sub(replacement, text)

或使用functools.partial()进一步简化循环:

from functools import partial

REPLACEMENTS = [
    partial(re.compile(r'(sample_pattern_1)').sub, r'REPLACE_1'),
    partial(re.compile(r'(sample_pattern_2)').sub, r'REPLACE_2'),
    # ..
    partial(re.compile(r'(sample_pattern_99)').sub, r'REPLACE_99'),
    partial(re.compile(r'(sample_pattern_100)').sub, r'REPLACE_100'),
]

和循环:

text = str(row[0])
for replacement in REPLACEMENTS:
    text = replacement(text)

或使用partial()个对象中包含的上述模式列表,以及reduce()

text = reduce(lambda txt, repl: repl(txt), REPLACEMENTS, str(row[0])

答案 1 :(得分:1)

你的方法很好;但是,在每一行上,您都将正则表达式应用于原始字符串。您需要将它应用于上一行的结果,即:

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        # This one stays the same - initialize from the row
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        # For these, route text back into it
        text = REPLACE_2.sub(r'REPLACE_2',text)
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',text)
        text = REPLACE_100.sub(r'REPLACE_100',text)
        print text

答案 2 :(得分:1)

看起来你需要的是:

    text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
    text = REPLACE_2.sub(r'REPLACE_1',text)
    # ..
    text = REPLACE_99.sub(r'REPLACE_99',text)
    text = REPLACE_100.sub(r'REPLACE_199',text)

答案 3 :(得分:1)

我可以建议建立一个模式列表及其替换值,然后迭代它吗?然后,每次要更新模式时都不必修改函数:

import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), 'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), 'REPLACE_2'),
# ..
    (re.compile(r'(sample_pattern_99)'), 'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), 'REPLACE_100'),
]

def extract_from_db():
    for row in cursor:
        text = str(row[0])
        for pattern, replacement in REPLACEMENTS:
            text = pattern.sub(replacement, text)

        print text

extract_from_db()