大型RegEx匹配导致程序挂起

时间:2012-05-21 15:26:12

标签: c# wpf regex parsing

我前几天尝试过问这个问题,并且确实没有说好问题或者首先发布代码,答案已经结束了。所以我在这里再试一次,因为老实说这让我很快就疯了。 :)

我正在尝试实现这个Address Parser,它最初是一个基于控制台的c#程序。我已成功将其转换为独立的WPF程序,该程序仅包含用于输入的TextBox,用于激活解析的Button和用于显示结果的TextBlock。在写这篇文章时,我确实将输出截断为我在主程序中需要的内容,但它仍能正常工作。我在下面列出了完整的代码。

我的下一步是将其移植到我的主程序中,我通过字面意思使用复制/粘贴来完成。然而,在运行此按钮后,程序会在按下按钮后挂起。最终VS给出了一个错误,即进程已经过长而没有抽出消息,并且TaskManager中的内存使用量从~70k逐渐增加到3,000,000。为此,我将Parsing方法分配给后台工作者,希望减轻主进程的工作量。这确实解决了程序冻结的问题,但是后台线程只是做了同样的事情,提高了RAM的使用率而没有返回任何内容。

所以现在我陷入了僵局。我知道问题出现在var result = parser.ParseAddress(input);语句中,就像在每行代码中使用断点一样,这是最后一个要触发的代码。但基本上我无法理解为什么这会导致一个WPF程序出现问题,而不是另一个。

如果有必要,我会非常乐意在某个地方发布主程序的完整源代码,但我无法想象在这里发布大约20个不同的类文件和项目代码是个好主意。 :)

独立WPF应用

namespace AddressParseWPF
{
    /// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        public MainWindow()
        {
            InitializeComponent();
        }

        public void Execute()
        {
            AddressParser.AddressParser parser = new AddressParser.AddressParser();
            var input = inputTextBox.Text;

            var result = parser.ParseAddress(input);

            if (result == null)
            {
                outputTextBlock.Text = "ERROR. Input could not be parsed.";
            }
            else
            {
                outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + "  " + result.Zip);
            }
        }

        private void actionButton_Click(object sender, RoutedEventArgs e)
        {
            Execute();
        }
    }
}

将Parser移植到

的主程序
public void ExecuteAddressParse()
{
    AddressParser.AddressParser parser = new AddressParser.AddressParser();
    var input = inputTextBox.Text;

    var result = parser.ParseAddress(input);

    if (result == null)
    {
        outputTextBlock.Text = "ERROR. Input could not be parsed.";
    }
    else
    {
        outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + "  " + result.Zip);
    }
}       

private void actionButton_Click(object sender, RoutedEventArgs e)
{
    ExecuteAddressParse();
}

ParseAddress方法

public AddressParseResult ParseAddress(string input)
{
    if (!string.IsNullOrWhiteSpace(input))
    {
        var match = addressRegex.Match(input.ToUpperInvariant());
        if (match.Success)
        {
            var extracted = GetApplicableFields(match);
            return new AddressParseResult(Normalize(extracted));
        }
    }

    return null;
}

RegEx匹配方法

private static void InitializeRegex()
{
    var suffixPattern = new Regex(
        string.Join(
            "|",
            new [] {
                string.Join("|", suffixes.Keys), 
                string.Join("|", suffixes.Values.Distinct())
            }),
        RegexOptions.Compiled);

    var statePattern = 
        @"\b(?:" + 
        string.Join(
            "|",
            new [] {
                string.Join("|", states.Keys.Select(x => Regex.Escape(x))),
                string.Join("|", states.Values)
            }) +
        @")\b";

    var directionalPattern =
        string.Join(
            "|",
            new [] {
                string.Join("|", directionals.Keys),
                string.Join("|", directionals.Values),
                string.Join("|", directionals.Values.Select(x => Regex.Replace(x, @"(\w)", @"$1\.")))
            });

    var zipPattern = @"\d{5}(?:-?\d{4})?";

    var numberPattern =
        @"(
            ((?<NUMBER>\d+)(?<SECONDARYNUMBER>(-[0-9])|(\-?[A-Z]))(?=\b))    # Unit-attached
            |(?<NUMBER>\d+[\-\ ]?\d+\/\d+)                                   # Fractional
            |(?<NUMBER>\d+-?\d*)                                             # Normal Number
            |(?<NUMBER>[NSWE]\ ?\d+\ ?[NSWE]\ ?\d+)                          # Wisconsin/Illinois
          )";

    var streetPattern =
        string.Format(
            CultureInfo.InvariantCulture,
            @"
                (?:
                  # special case for addresses like 100 South Street
                  (?:(?<STREET>{0})\W+
                     (?<SUFFIX>{1})\b)
                  |
                  (?:(?<PREDIRECTIONAL>{0})\W+)?
                  (?:
                    (?<STREET>[^,]*\d)
                    (?:[^\w,]*(?<POSTDIRECTIONAL>{0})\b)
                   |
                    (?<STREET>[^,]+)
                    (?:[^\w,]+(?<SUFFIX>{1})\b)
                    (?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
                   |
                    (?<STREET>[^,]+?)
                    (?:[^\w,]+(?<SUFFIX>{1})\b)?
                    (?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
                  )
                )
            ",
            directionalPattern,
            suffixPattern);

    var rangedSecondaryUnitPattern =
        @"(?<SECONDARYUNIT>" +
        string.Join("|", rangedSecondaryUnits.Keys) +
        @")(?![a-z])";
    var rangelessSecondaryUnitPattern =
        @"(?<SECONDARYUNIT>" +
        string.Join(
            "|",
            string.Join("|", rangelessSecondaryUnits.Keys)) +
        @")\b";
    var allSecondaryUnitPattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            (
                (:?
                    (?: (?:{0} \W*)
                        | (?<SECONDARYUNIT>\#)\W*
                    )
                    (?<SECONDARYNUMBER>[\w-]+)
                )
                |{1}
            ),?
        ",
         rangedSecondaryUnitPattern,
         rangelessSecondaryUnitPattern);

    var cityAndStatePattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            (?:
                (?<CITY>[^\d,]+?)\W+
                (?<STATE>{0})
            )
        ",
        statePattern);
    var placePattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            (?:{0}\W*)?
            (?:(?<ZIP>{1}))?
        ",
        cityAndStatePattern,
        zipPattern);

    var addressPattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            ^
            # Special case for APO/FPO/DPO addresses
            (
                [^\w\#]*
                (?<STREETLINE>.+?)
                (?<CITY>[AFD]PO)\W+
                (?<STATE>A[AEP])\W+
                (?<ZIP>{4})
                \W*
            )
            |
            # Special case for PO boxes
            (
                \W*
                (?<STREETLINE>(P[\.\ ]?O[\.\ ]?\ )?BOX\ [0-9]+)\W+
                {3}
                \W*
            )
            |
            (
                [^\w\#]*    # skip non-word chars except # (eg unit)
                (  {0} )\W*
                   {1}\W+
                (?:{2}\W+)?
                   {3}
                \W*         # require on non-word chars at end
            )
            $           # right up to end of string
        ",
        numberPattern,
        streetPattern,
        allSecondaryUnitPattern,
        placePattern,
        zipPattern);
    addressRegex = new Regex(
        addressPattern,
        RegexOptions.Compiled | 
        RegexOptions.Singleline | 
        RegexOptions.IgnorePatternWhitespace);
}

3 个答案:

答案 0 :(得分:5)

当省略RegexOptions.Compiled标志时,正则表达式是否有效?

回答是肯定的。

为什么?

似乎Regex编译器对于某些(某些?)大型模式很慢。

这是你必须做出的权衡。

答案 1 :(得分:1)

一些正则表达式的子表达式是不合适的(正如@Justin Morgan所提到的那样) 这通常是加入可重复使用的碎片正则表达式的结果,它使得 我畏缩

但是,如果你打算使用/做这种方法,打印出来总是一个好主意 构建后的实际正则表达式。并且,在格式化之后,对其进行测试 样品并独立于您的主程序进行。它更容易修复。
如果您看到可疑的子表达式,请尝试在此时使其失败,或者 通常,尝试在样本末尾附近插入失败。如果需要超过
一眨眼就失败了,然后它又严重地回溯了。

回溯并不是很糟糕。它有一个巨大的上升空间。没有它,有些东西
只是不匹配。诀窍是隔离不影响
的子表达式 相对于它周围的结果,然后限制它从bactracking。

我去了USPS网站并抓住了一些样本状态/后缀/方向/中学
样本,足以生成地址正则表达式。以下是清理版本的
从您的代码生成的正则表达式。

祝你好运!

 ^
   # Special case for APO/FPO/DPO addresses
   (
      [^\w\#]*
      (?<STREETLINE> .+? )
      (?<CITY> [AFD] PO )
      \W+
      (?<STATE> A [AEP] )
      \W+
      (?<ZIP> \d{5} (?: -? \d{4} )? )
      \W*
   )
 |         
   # Special case for PO boxes
   (
      \W*
      (?<STREETLINE> ( P [\.\ ]? O [\.\ ]? \  )? BOX \  [0-9]+ )
      \W+
      (?:
          (?:
              (?<CITY> [^\d,]+? )
              \W+
              (?<STATE>
                 \b
                 (?:AL|AK|AS|AZ|AR|Alabama|Alaska|American Samoa|Arizona|Arkansas)
                 \b
              )
          )
          \W*
      )?
      (?:
          (?<ZIP> \d{5} (?: -? \d{4} )? )
      )?
      \W*
   )
 |          
   (
       [^\w\#]*    # skip non-word chars except # (eg unit)
       (
         (
              (
                (?<NUMBER> \d+ )
                (?<SECONDARYNUMBER> (-[0-9]) | (\-?[A-Z]) )
                (?=\b)
              )                                                  # Unit-attached
           |          
             (?<NUMBER> \d+ [\-\ ]? \d+ \/ \d+ )                 # Fractional
           |
             (?<NUMBER> \d+ -? \d* )                             # Normal Number
           |
             (?<NUMBER>[NSWE]\ ?\d+\ ?[NSWE]\ ?\d+)              # Wisconsin/Illinois
         )
       )
       \W*

       (?:
           # special case for addresses like 100 South Street
           (?:
               (?<STREET>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
               \W+
               (?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
               \b
           )
         |
           (?:
               (?<PREDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
               \W+
           )?
           (?:
                (?<STREET> [^,]* \d )
                (?:
                   [^\w,]*
                   (?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
                   \b
                )
             |
                (?<STREET> [^,]+ )
                (?:
                    [^\w,]+
                    (?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
                    \b
                )
                (?:
                    [^\w,]+
                    (?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
                    \b
                )?
             |
                (?<STREET> [^,]+? )
                (?:
                    [^\w,]+
                    (?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
                    \b
                )?
                (?:
                    [^\w,]+
                    (?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
                    \b
                )?
           )
       )           

       \W+        

       (?:      
           (
               (
                  :?
                  (?:
                      (?:
                         (?<SECONDARYUNIT>APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|SLIP|SPC|STOP|STE|TRLR|UNIT)
                         (?! [a-z] )
                         \W*
                       )
                    |
                       (?<SECONDARYUNIT> \# )
                       \W*
                  )
                  (?<SECONDARYNUMBER> [\w-]+ )
               )
             |
               (?<SECONDARYUNIT>BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR)
               \b
           )
           ,?
           \W+
       )?

       (?:
           (?:
               (?<CITY> [^\d,]+? )
               \W+
               (?<STATE>
                  \b
                  (?:AL|AK|AS|AZ|AR|Alabama|Alaska|American Samoa|Arizona|Arkansas)
                  \b
               )
           )
           \W*
       )?

       (?:
           (?<ZIP> \d{5} (?: -? \d{4} )? )
       )?

       \W*         # require on non-word chars at end
   )
 $           # right up to end of string

C#代码

   public static void InitializeRegex()
    {
        Dictionary<string, string> suffixes = new Dictionary<string, string>()
        {
          {"ALLEY",  "ALLEE"},
          {"ALY",  "ALLEY"},
          {"ALLY",  "ALY"},
        };

        var suffixPattern = new Regex(
            string.Join(
                "|",
                new[] {
            string.Join("|", suffixes.Keys.ToArray()), 
            string.Join("|", suffixes.Values.Distinct().ToArray())
        }),
            RegexOptions.Compiled);

        //Console.WriteLine("\n"+suffixPattern);

        Dictionary<string, string> states = new Dictionary<string, string>()
        {
           {"AL", "Alabama"},
           {"AK", "Alaska"},
           {"AS",  "American Samoa"},
           {"AZ",  "Arizona"},
           {"AR", "Arkansas"}
        };

        var statePattern =
            @"\b(?:" +
            string.Join(
                "|",
                new[] {
            string.Join("|", states.Keys.Select(x => Regex.Escape(x)).ToArray()),
            string.Join("|", states.Values.ToArray())
        }) +
            @")\b";

        //Console.WriteLine("\n" + statePattern);

        Dictionary<string, string> directionals = new Dictionary<string, string>()
        {
           {"North", "N" },
           {"East", "E" },
           {"South", "S" },
           {"West", "W" },
           {"Northeast", "NE" },
           {"Southeast", "SE" },
           {"Northwest", "NW" },
           {"Southwest", "SW" }
        };

        var directionalPattern =
            string.Join(
                "|",
                new[] {
            string.Join("|", directionals.Keys.ToArray()),
            string.Join("|", directionals.Values.ToArray()),
            string.Join("|", directionals.Values.Select(x => Regex.Replace(x, @"(\w)", @"$1\.")).ToArray())
        });

        //Console.WriteLine("\n" + directionalPattern);

        var zipPattern = @"\d{5}(?:-?\d{4})?";

        //Console.WriteLine("\n" + zipPattern);

        var numberPattern =
            @"(
                ((?<NUMBER>\d+)(?<SECONDARYNUMBER>(-[0-9])|(\-?[A-Z]))(?=\b))    # Unit-attached
                |(?<NUMBER>\d+[\-\ ]?\d+\/\d+)                                   # Fractional
                |(?<NUMBER>\d+-?\d*)                                             # Normal Number
                |(?<NUMBER>[NSWE]\ ?\d+\ ?[NSWE]\ ?\d+)                          # Wisconsin/Illinois
             )";

        //Console.WriteLine("\n" + numberPattern);

        var streetPattern =
            string.Format(
                CultureInfo.InvariantCulture,
                @"
                    (?:
                      # special case for addresses like 100 South Street
                      (?:(?<STREET>{0})\W+
                         (?<SUFFIX>{1})\b)
                      |
                      (?:(?<PREDIRECTIONAL>{0})\W+)?
                      (?:
                        (?<STREET>[^,]*\d)
                        (?:[^\w,]*(?<POSTDIRECTIONAL>{0})\b)
                       |
                        (?<STREET>[^,]+)
                        (?:[^\w,]+(?<SUFFIX>{1})\b)
                        (?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
                       |
                        (?<STREET>[^,]+?)
                        (?:[^\w,]+(?<SUFFIX>{1})\b)?
                        (?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
                      )
                    )
                ",
                directionalPattern,
                suffixPattern);

        //Console.WriteLine("\n" + streetPattern);


        Dictionary<string, string> rangedSecondaryUnits = new Dictionary<string, string>()
        {
            {"APT",  "APARTMENT"},
            {"BLDG", "BUILDING"}, 
            {"DEPT", "DEPARTMENT"}, 
            {"FL",   "FLOOR"}, 
            {"HNGR", "HANGAR"}, 
            {"LOT",  "LOT"}, 
            {"PIER", "PIER"}, 
            {"RM",   "ROOM"}, 
            {"SLIP", "SLIP"}, 
            {"SPC",  "SPACE"}, 
            {"STOP", "STOP"}, 
            {"STE",  "SUITE"}, 
            {"TRLR", "TRAILER"}, 
            {"UNIT", "UNIT"} 
        };
        var rangedSecondaryUnitPattern =
            @"(?<SECONDARYUNIT>" +
            string.Join("|", rangedSecondaryUnits.Keys.ToArray()) +
            @")(?![a-z])";

        //Console.WriteLine("\n" + rangedSecondaryUnitPattern);


        Dictionary<string, string> rangelessSecondaryUnits = new Dictionary<string, string>()
        {
            {"BSMT", "BASEMENT"},
            {"FRNT", "FRONT"},
            {"LBBY", "LOBBY"},
            {"LOWR", "LOWER"},
            {"OFC",  "OFFICE"},
            {"PH",   "PENTHOUSE"},
            {"REAR", "REAR"},
            {"SIDE", "SIDE"},
            {"UPPR", "UPPER"}
        };

        var rangelessSecondaryUnitPattern =
            @"(?<SECONDARYUNIT>" +
            string.Join("|", rangelessSecondaryUnits.Keys.ToArray()) +
            @")\b";

        //Console.WriteLine("\n" + rangelessSecondaryUnitPattern);

        var allSecondaryUnitPattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                (
                    (:?
                        (?: (?:{0} \W*)
                            | (?<SECONDARYUNIT>\#)\W*
                        )
                        (?<SECONDARYNUMBER>[\w-]+)
                    )
                    |{1}
                ),?
            ",
             rangedSecondaryUnitPattern,
             rangelessSecondaryUnitPattern);

        //Console.WriteLine("\n" + allSecondaryUnitPattern);

        var cityAndStatePattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                (?:
                    (?<CITY>[^\d,]+?)\W+
                    (?<STATE>{0})
                )
            ",
            statePattern);

        //Console.WriteLine("\n" + cityAndStatePattern);

        var placePattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                (?:{0}\W*)?
                (?:(?<ZIP>{1}))?
            ",
            cityAndStatePattern,
            zipPattern);

        //Console.WriteLine("\n" + placePattern);

        var addressPattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                ^
                # Special case for APO/FPO/DPO addresses
                (
                    [^\w\#]*
                    (?<STREETLINE>.+?)
                    (?<CITY>[AFD]PO)\W+
                    (?<STATE>A[AEP])\W+
                    (?<ZIP>{4})
                    \W*
                )
                |
                # Special case for PO boxes
                (
                    \W*
                    (?<STREETLINE>(P[\.\ ]?O[\.\ ]?\ )?BOX\ [0-9]+)\W+
                    {3}
                    \W*
                )
                |
                (
                    [^\w\#]*    # skip non-word chars except # (eg unit)
                    (  {0} )\W*
                       {1}\W+
                    (?:{2}\W+)?
                       {3}
                    \W*         # require on non-word chars at end
                )
                $           # right up to end of string
            ",
            numberPattern,
            streetPattern,
            allSecondaryUnitPattern,
            placePattern,
            zipPattern);

        Console.WriteLine("\n-----------------------------\n\n" + addressPattern);

        var addressRegex = new Regex(
            addressPattern,
            RegexOptions.Compiled |
            RegexOptions.Singleline |
            RegexOptions.IgnorePatternWhitespace);

    }

答案 2 :(得分:0)

逐渐增加这样的资源使用量是catastrophic backtracking的吸烟枪。基本上,如果你有类似的东西,比如这部分:

(?<CITY>[^\d,]+?)\W+

...然后,关于哪一条输入匹配模式的哪一部分会有歧义。几乎所有匹配\W的内容也可以匹配[^\d,]。如果输入在第一次传递时无法匹配,则引擎将返回并尝试对这两组进行不同的排列,从而咀嚼资源。

例如,假设输入的“city”部分后面有一大堆空格。长字符串空格将匹配[^\d,]+?\W+,因此不清楚CITY组是否包含空格。根据这些量词的懒惰/贪婪行为,引擎会尝试只将城市名称放入[^\d,]+?以及\W+中的所有空格。然后它将继续并尝试匹配其余输入。

如果输入的其余部分在第一次尝试时匹配,那么很好。但是如果匹配失败,则必须返回并再次尝试,这次其中一个空格由[^\d,]+?匹配并作为CITY组的一部分捕获。如果失败,它将再次尝试两个空格,依此类推。

您通常会发现这会成为嵌套量词的问题,例如:像([ABC]+)*这样的东西。我的模式中没有看到任何内容,但我可能在所有string.Format次调用中都遗漏了一些内容。我的猜测只是它是如此长的模式,并且有如此多的量词和交互器来回溯(以及许多组存储),即使是单一级别的迭代也会杀死你。我敢打赌,你会得到最大的性能影响,输入字符串匹配模式的大多数,但无法匹配所有字符串。

在这种情况下编译正则表达式可能会有所帮助,你应该这样做。但是当你的应用程序同时有一千次(或多次)命中时,我怀疑是否会削减它。还会有某些输入字符串导致大量的回溯并且在性能方面更难以击中你。我最大的建议是找到并解决模式中的含糊之处。

查找包含许多量词的地方,例如*+彼此靠近,并确保它们之间有明确的,非可选的分隔符(例如,来自您的\d+-?\d* NUMBER组的效果会更好\d+(-\d*)?,或者更好\d+(-\d+)?\b)。最后,确保分隔符与它们旁边的标记不匹配。对于一个简单的例子,如果你给它添加了一长串空格,那么像\W+\ \W+这样的东西会一直拖着。

相关问题