具有多行和特殊结构的字符串的正则表达式

时间:2013-05-31 12:37:00

标签: java regex

我正在使用Java并且想要构建两个适合两种不同场景的reg表达式:

1:

STARTText blah, blah
\    next line with more text, but the leading backslash
\    next line with more text, but the leading backslash
\    next line with more text, but the leading backslash

直到第一行不再以反斜杠开头。

2:

Now you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

并且此块在以下之后以另外的空行结束。 8978.但另外我知道,带有起始数字的块将重复10次然后结束。

因此,以某种方式过滤单个行是可能的,但如何在它们之间使用多个换行符?即使是第一个块,当我不知道何时/如何结束它。还搜索反斜杠。所以,我的方法是使用一个闭合的表达式,只有一个 - 我也可以用于replaceAll()

4 个答案:

答案 0 :(得分:1)

第一个正则表达式:

Pattern regex = Pattern.compile(
    "^          # Start of line\n" +
    "STARTText  # Match this text\n" +
    ".*\\r?\\n  # Match whatever follows on the line plus (CR)LF\n" +
    "(?:        # Match...\n" +
    " ^\\\\     # Start of line, then a backslash\n" +
    " .*\\r?\\n # Match whatever follows on the line plus (CR)LF\n" +
    ")*         # Repeat as needed", 
    Pattern.MULTILINE | Pattern.COMMENTS);

第二个正则表达式:

Pattern regex = Pattern.compile(
    "(?:        # Match...\n" +
    " ^         # Start of line\n" +
    " \\d{4}\\b # Match exactly four digits\n" +
    " .*\\r?\\n # Match whatever follows on the line plus (CR)LF\n" +
    ")+         # Repeat as needed (at least once)", 
    Pattern.MULTILINE | Pattern.COMMENTS);

答案 1 :(得分:1)

正则表达式1:

/^STARTText.*?(\r?\n)(?:^\\.*?\1)+/m

现场演示: http://www.rubular.com/r/G35kIn3hQ4

正则表达式2:

/^.*?(\r?\n)(?:^\d{4}\s.*?\1)+/m

现场演示: http://www.rubular.com/r/TxFbBP1jLJ

编辑:

Java Demo 1:http://ideone.com/BPNrm6

Java中的Regex 1:

(?m)^STARTText.*?(\\r?\\n)(?:^\\\\.*?\\1)+

Java Demo 2:http://ideone.com/TQB8Gs

Java中的Regex 2:

(?m)^.*?(\\r?\\n)(?:^\\d{4}\\s.*?\\1)+

答案 2 :(得分:1)

在这两种情况下,我都使用像(?=^[^\\])这样的零断言预测来确保下一行继续拥有我正在寻找的东西。

  • (?=启动零断言预测,这需要存在的值但不消耗值
  • ^[^\\]匹配一行的开头,后跟任何字符,然后是\
  • )关闭断言

第1部分

这将匹配第1部分的所有文本,其中捕获的第一行后跟任意数量的\行。

^([^\\].*?)(?=^[^\\])

Regular expression image

Edit live on Debuggex

    Java Code Example:
    import java.util.regex.Pattern;
    import java.util.regex.Matcher;
    class Module1{
      public static void main(String[] asd){
      String sourcestring = "STARTFirstText blah, blah
\    1next line with more text, but the leading backslash
\    2next line with more text, but the leading backslash
\    3next line with more text, but the leading backslash
STARTsecondText blah, blah
\    4next line with more text, but the leading backslash
\    5next line with more text, but the leading backslash
\    6next line with more text, but the leading backslash
foo";
      Pattern re = Pattern.compile("^([^\\\\].*?)(?=^[^\\\\])",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
      Matcher m = re.matcher(sourcestring);
      int mIdx = 0;
        while (m.find()){
          for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
            System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
          }
          mIdx++;
        }
      }
    }

    $matches Array:
    (
        [0] => Array
            (
                [0] => STARTFirstText blah, blah
    \    1next line with more text, but the leading backslash
    \    2next line with more text, but the leading backslash
    \    3next line with more text, but the leading backslash

                [1] => STARTsecondText blah, blah
    \    4next line with more text, but the leading backslash
    \    5next line with more text, but the leading backslash
    \    6next line with more text, but the leading backslash

            )

        [1] => Array
            (
                [0] => STARTFirstText blah, blah
    \    1next line with more text, but the leading backslash
    \    2next line with more text, but the leading backslash
    \    3next line with more text, but the leading backslash

                [1] => STARTsecondText blah, blah
    \    4next line with more text, but the leading backslash
    \    5next line with more text, but the leading backslash
    \    6next line with more text, but the leading backslash

            )

    )

第2部分

这将匹配第一行,后跟几行以数字

开头的行
^([^\d].*?)(?=^[^\d])

Regular expression image

Edit live on Debuggex

实施例

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

Second you will see the following links for the items:
2222 leading 4 digits and then some text
3333 leading 4 digits and then some text
4444 leading 4 digits and then some text";
  Pattern re = Pattern.compile("^([^\\d].*?)(?=^[^\\d])",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

$matches Array:
(
    [0] => Array
        (
            [0] => First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

            [1] => 

        )

    [1] => Array
        (
            [0] => First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

            [1] => 

        )

)

答案 3 :(得分:0)

对反斜杠使用'\',对一个换行使用'\ r \ n | \ _ \',对4位数使用'\ d {4}':

.*(\r|r\n)

(你的第一个blahblah)

\\.*(\r|r\n)

(你的反斜杠行)

((\d{4}.*(\r|r\n))+(\r|\r\n))+

(你的4个数字块以emtpy行结尾,整个用+重复)