正则表达式匹配记录边界

时间:2015-03-06 20:52:40

标签: python regex

我正在尝试编写一个python脚本,在大型文本文件中搜索Oracle错误号。这些文件没有保证的记录分隔符。因此我在多字节块中执行操作。

块内的正则表达式匹配似乎是一项微不足道的任务,但是我很难在块的开头或结尾处绕部分匹配。

要匹配的完整正则表达式是类似于以下

的oracle错误号
`"ORA\-[0-9]{1,5}"` 

如何编写匹配其子集的正则表达式?举个例子;块末尾的部分匹配将是以下之一:

(O$, OR$, ORA$, ORA\-$, ORA\-n$, or ORA\-nn$)

相反,在块的开头我会搜索

(^n, ^nn, ^\-nn, ^A\-nn, or ^RA\-nn)

将保存块末尾的部分匹配,以便与下一个块的开始进行比较。

积极的外观似乎很有希望,但与我要求的其他角色不匹配。可以通过正则表达式有效地执行这种查找方式吗?

1 个答案:

答案 0 :(得分:1)

我认为这里真正的答案是你不想在raw中使用正则表达式。正则表达式对于您想要做的事情来说有点过高。您需要的是 tokenizer 。标记化器是一种易于理解的技术,因为它是每个编译器的重要组成部分。这就是将文本分解为 lexemes 的内容,这些文本意味着什么。这里对您很重要的关键特性是,标记器一次查看一个字符以标记源字符串。此特性允许您流式传输文件而不是以块的形式加载文件,并避免划分块的所有肮脏。

tokenizer只是有限状态机的一种实现。 (您应该注意,正则表达式也只是有限状态机的定义。)您所要做的就是确定您的状态以及何时创建词法。由于你有一小部分状态可供使用,这实际上并不那么难。这个想法很基本。你编写了一个大的if / else块,它首先检查你所处的当前状态(通过查看前面的字符得到的),然后根据当前字符的内容检查一些更多的条件逻辑。

顺便说一句,如果你想更好地理解所有这些东西,请参加编译器课程。您将在其中学习的概念和技术非常对复杂的文本处理非常有用。当你正在构建处理文本的东西时,它们会成为一个很好的解决方案,这有点令人惊讶。

Tokenizer代码往往有点冗长和丑陋,但它非常标准。它或多或少遵循标准模式的事实使其相对易于理解,但最重要的是,工作。我在下面写了一个。编写多个数字的检查可能有更短的方法,但我只是做了很长的路,以便更容易理解正在发生的事情。我实际上没有测试过这段代码,所以要彻底测试并调试,但逻辑应该是合理的。祝你好运。

import re

# Gonna be using this a lot, so compile it.
digit_pattern = re.compile('[0-9]')

# We're creating a class because there's a little bit of state to maintain.
class OracleErrorFinder(object):
    def __init__(self, input_file):
        self.input_file = input_file
        # This seems weird, but there's a good reason.
        # When we get to the end of a match, we're going to have already consumed
        # the next character from the file. So we need to save it for the next round.
        next_char = None

    def find_next_match(self):
        # Possible states are
        # '': We haven't found any portion of the pattern yet.
        # 'O': We found an O
        # 'R': We found an OR
        # 'A': We found an ORA
        # '-': We found an ORA-
        # 'num1': We found ORA-[0-9]
        # 'num2': We found ORA-[0-9][0-9]
        # 'num3': We found ORA-[0-9][0-9][0-9]
        # 'num4': We found ORA-[0-9][0-9][0-9][0-9]
        # 'num5': We found ORA-[0-9][0-9][0-9][0-9][0-9], and we're done

        current_state = ''
        match_so_far = ''
        done = False
        while not done:
            if self.next_char:
                # If we have a leftover char from last time, 
                # start with that and clear it.
                c = self.next_char
                self.next_char = None
            else:
                c = self.input_file.read(1)

            if '' == c:
               match_so_far = None
               done = True # End of stream and we didn't find a match. Time to stop.
            elif '' == current_state and 'O' == c:
                # We found the start of what we're looking for.
                # We don't know if it's the whole thing,
                # so we just save it and go to the next character.
                current_state = 'O'
                match_so_far = 'O'
            elif 'O' == current_state and 'R' == c:
                # We already have an O and now we found the next character!
                current_state = 'R'
                match_so_far += c
            elif 'R' == current_state and 'A' == c:
                current_state = 'A'
                match_so_far += c
            elif 'A' == current_state and '-' == c:
                current_state = '-'
                match_so_far += c
            elif '-' == current_state and digit_pattern.match(c):
                current_state = 'num1'
                match_so_far += c
            elif 'num1' == current_state:
                if digit_pattern.match(c):
                    current_state = 'num2'
                    match_so_far += c
                else:
                    # We found a full match,
                    # but not more numbers past the last one.
                    # Time to return what we found.
                    done = True
            elif 'num2' == current_state:
                if digit_pattern.match(c):
                    current_state = 'num3'
                    match_so_far += c
                else:
                    # We found a full match,
                    # but not more numbers past the last one.
                    # Time to return what we found.
                    done = True
            elif 'num3' == current_state:
                if digit_pattern.match(c):
                    current_state = 'num4'
                    match_so_far += c
                else:
                    # We found a full match,
                    # but not more numbers past the last one.
                    # Time to return what we found.
                    done = True
            elif 'num4' == current_state:
                if digit_pattern.match(c):
                    current_state = 'num5'
                    match_so_far += c
                else:
                    # We found a full match,
                    # but not more numbers past the last one.
                    # Time to return what we found.
                    done = True
            elif 'num5' == current_state:
                # We're done for sure!
                # Note that we read the next character from the file.
                # Important for code after the loop.
                done = True
            else:
                # We didn't find the next character we wanted.
                if 'O' == c:
                    # We didn't find a full match, but this starts
                    # a new one.
                    current_state = 'O'
                    match_so_far = 'O'
                else:
                    # This character doesn't match our pattern.
                    # It could be a character that's in the wrong place
                    # (such as the - in OR-) or a character that just
                    # doesn't appear in the pattern at all (like X).
                    # We might be in the middle of a partial
                    # match, so throw everything found so far away
                    # and keep going.
                    current_state = ''
                    match_so_far = ''

        # Save next char already consumed from file stream.
        # Could be empty string if we consumed the whole file,
        # but that's fine.
        self.next_char = c
        return match_so_far

with open(filename) as f:
    finder = OracleErrorFinder(f)
    while True:
        match = finder.find_next_match()
        if None is match:
            break
        # Print, send to file, add to list, what have you