Question

请耐心等待，我不能包含我的1,000+行程序，并且说明中有几个问题。

所以我要搜索几种类型的模式：

#literally just a regular word
re.search("Word", arg)

#Varying complex pattern
re.search("[0-9]{2,6}-[0-9]{2}-[0-9]{1}", arg)

#Words with varying cases and the possibility of ending special characters 
re.search("Supplier [Aa]ddress:?|Supplier [Ii]dentification:?|Supplier [Nn]ame:?", arg)

#I also use re.findall for the above patterns as well
re.findall("uses patterns above", arg

我总共有大约75个，有些需要转移到深层嵌套的函数

我应该在何时何地编译模式？

现在我正在尝试通过编译main中的所有内容来改进我的程序，然后将正确的已编译RegexObjects列表传递给使用它的函数。 这会提高我的表现吗？

执行以下操作可以提高程序的速度吗？

re.compile("pattern").search(arg)

编译的模式是否保留在内存中，因此如果函数被多次调用，它是否会跳过编译部分？所以我不必将数据从函数移动到函数。

如果我将数据移动太多，是否值得编译所有模式？

有没有更好的方法来匹配常规字而不使用正则表达式？

我的代码的简短示例：

import re

def foo(arg, allWords):
   #Does some things with arg, then puts the result into a variable, 
   # this function does not use allWords

   data = arg #This is the manipulated version of arg

   return(bar(data, allWords))


def bar(data, allWords):
   if allWords[0].search(data) != None:
      temp = data.split("word1", 1)[1]
      return(temp)

   elif allWords[1].search(data) != None:
      temp = data.split("word2", 1)[1]
      return(temp)


def main():

   allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

   arg = "This is a very long string from a text document input, the provided patterns might not be word1 in this string but I need to check for them, and if they are there do some cool things word3"

   #This loop runs a couple million times 
   # because it loops through a couple million text documents
   while True:
      data = foo(arg, allWords)

Answer 1

这是一个棘手的主题：许多答案，甚至一些合法的来源，如David Beazley的Python Cookbook，都会告诉你类似的事情：

当您要使用相同的模式执行大量匹配时，
[使用compile()]。这使得您只能在每次匹配时编译一次正则表达式。 [见第那本书中的45本]

但是，从Python 2.5开始，这确实不是真的。以下是re文档中的注释：

注意传递给re.compile()的最新模式的编译版本和模块级匹配函数被缓存，因此一次只使用几个正则表达式的程序不需要'担心编译正则表达式。

对此有两个小论点，但（传闻上说）这些不会在大多数时间内导致显着的时间差异：

缓存的大小有限。
直接使用编译表达式可以避免缓存查找开销。

以下是使用20 newsgroups text dataset对上述内容进行的初步测试。相对而言，编译时速度的提升约为1.6％，可能主要是由于缓存查找。

import re
from sklearn.datasets import fetch_20newsgroups

# A list of length ~20,000, paragraphs of text
news = fetch_20newsgroups(subset='all', random_state=444).data

# The tokenizer used by most text-processing vectorizers such as TF-IDF
regex = r'(?u)\b\w\w+\b'
regex_comp = re.compile(regex)


def no_compile():
    for text in news:
        re.findall(regex, text)


def with_compile():
    for text in news:
        regex_comp.findall(text)

%timeit -r 3 -n 5 no_compile()
1.78 s ± 16.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)

%timeit -r 3 -n 5 with_compile()
1.75 s ± 12.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)

这真的只留下一个非常可靠的理由来使用re.compile()：

通过在加载模块时预编译所有表达式，编译工作转移到应用程序启动时间，而不是程序可能响应用户操作的时间点。 [source; p. 15]。在compile模块顶部声明常量的情况并不少见。例如，在smtplib中，您会找到OLDSTYLE_AUTH = re.compile(r"auth=(.*)", re.I)。

请注意，无论您是否使用re.compile()，都会（最终）进行编译。当您使用compile()时，您正在编译当时传递的正则表达式。如果您使用像re.search()这样的模块级函数，那么您正在编译和搜索这一个调用。以下两个流程在这方面是等效的：

# with re.compile - gets you a regular expression object (class)
#     and then call its method, `.search()`.
a = re.compile('regex[es|p]')  # compiling happens now
a.search('regexp')             # searching happens now

# with module-level function
re.search('regex[es|p]', 'regexp')  # compiling and searching both happen here

最后你问了，

是否有更好的方法来匹配常规单词而不使用正则表达式？

是;这在HOWTO中被称为"common problem"：

有时使用re模块是错误的。如果你匹配一个固定的   字符串或单个字符类，并且您没有使用任何re   诸如IGNORECASE标志之类的功能，然后是常规的全部功能   表达式可能不是必需的。 字符串有几种方法   使用固定字符串执行操作，它们通常很多   更快，因为实现是一个小的C循环   已针对此目的进行了优化，而不是大型，更通用化   正则表达式引擎。 [强调补充]

...

简而言之，在转向re模块之前，请考虑一下你的   问题可以通过更快更简单的字符串方法来解决。

Answer 2

让我们说word1，word2 ...是正则表达式：

让我们改写那些部分：

allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

我会为所有模式创建一个单一的正则表达式：

allWords = re.compile("|".join(["word1", "word2", "word3"])

要支持其中包含|的正则表达式，您必须将表达式括起来：

allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])

（当然，这也适用于标准词，由于|部分，它仍然值得使用正则表达式）

现在这是一个伪装的循环，每个术语都是硬编码的：

def bar(data, allWords):
   if allWords[0].search(data) != None:
      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW
      return(temp)

   elif allWords[1].search(data) != None:
      temp = data.split("word2", 1)[1]
      return(temp)

可以简单地重写为

def bar(data, allWords):
   return allWords.split(data,maxsplit=1)[1]

在表现方面：

正则表达式在开始时编译，因此它的速度可以快到
没有循环或粘贴的表达式，“或”部分由正则表达式引擎完成，大部分时间是一些编译代码：在纯python中无法击败它。
比赛＆amp;拆分是在一次操作中完成的

最后一个打嗝是内部正则表达式引擎在循环中搜索所有表达式，这使得它成为O(n)算法。为了使它更快，你必须预测哪种模式是最常见的，并把它放在第一位（我的假设是正则表达式是“不相交的”，这意味着文本不能与几个匹配，否则最长的必须来自较短的一个）

何时使用re.compile

2 个答案: