Question

我有很多替换模式，我需要进行文本清理。我出于性能原因从数据库加载数据并编译正则表达式。不幸的是，在我的方法中，只有变量“text”的最后一个赋值似乎是有效的，而其他变量似乎被覆盖了：

# -*- coding: utf-8 -*-
import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
        text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
        print text

extract_from_db()

有谁知道如何以优雅的方式解决这个问题？或者我是否必须通过巨大的if / elif控制结构来解决这个问题？

Answer 1

您继续使用str(row[0])上的替换结果替换最后一个结果。使用text代替累积替换：

text = REPLACE_1.sub(r'REPLACE_1', str(row[0]))
text = REPLACE_1.sub(r'REPLACE_1', text)
# ..
text = REPLACE_99.sub(r'REPLACE_99', text)
text = REPLACE_100.sub(r'REPLACE_199', text)

您最好使用实际列表：

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), r'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), r'REPLACE_2'),
    # ..
    (re.compile(r'(sample_pattern_99)'), r'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), r'REPLACE_100'),
]

并在循环中使用它们：

text = str(row[0])
for pattern, replacement in REPLACEMENTS:
    text = pattern.sub(replacement, text)

或使用functools.partial()进一步简化循环：

from functools import partial

REPLACEMENTS = [
    partial(re.compile(r'(sample_pattern_1)').sub, r'REPLACE_1'),
    partial(re.compile(r'(sample_pattern_2)').sub, r'REPLACE_2'),
    # ..
    partial(re.compile(r'(sample_pattern_99)').sub, r'REPLACE_99'),
    partial(re.compile(r'(sample_pattern_100)').sub, r'REPLACE_100'),
]

和循环：

text = str(row[0])
for replacement in REPLACEMENTS:
    text = replacement(text)

或使用partial()个对象中包含的上述模式列表，以及reduce()：

text = reduce(lambda txt, repl: repl(txt), REPLACEMENTS, str(row[0])

Answer 2

你的方法很好;但是，在每一行上，您都将正则表达式应用于原始字符串。您需要将它应用于上一行的结果，即：

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        # This one stays the same - initialize from the row
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        # For these, route text back into it
        text = REPLACE_2.sub(r'REPLACE_2',text)
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',text)
        text = REPLACE_100.sub(r'REPLACE_100',text)
        print text

Answer 3

看起来你需要的是：

    text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
    text = REPLACE_2.sub(r'REPLACE_1',text)
    # ..
    text = REPLACE_99.sub(r'REPLACE_99',text)
    text = REPLACE_100.sub(r'REPLACE_199',text)

Answer 4

我可以建议建立一个模式列表及其替换值，然后迭代它吗？然后，每次要更新模式时都不必修改函数：

import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), 'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), 'REPLACE_2'),
# ..
    (re.compile(r'(sample_pattern_99)'), 'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), 'REPLACE_100'),
]

def extract_from_db():
    for row in cursor:
        text = str(row[0])
        for pattern, replacement in REPLACEMENTS:
            text = pattern.sub(replacement, text)

        print text

extract_from_db()

Python中编译的正则表达式列表

4 个答案: