从文本中提取信息

时间:2018-03-13 12:19:36

标签: java regex nlp

我有以下文字:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.              

Name                                 Group                       12345678        
ALEX A ALEX                                                                   
ID#                                  PUBLIC NETWORK                  
XYZ123456789                                                                  


Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

我想提取文本中ID#关键字下的ID值。

问题在于,在不同的文本文件中,ID可以位于不同的位置,例如位于另一个文本的中间,如下所示:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's          
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.       

此外,ID#和值之间可以有额外的行:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's      
printing and typesetting industry. Lorem Ipsum has been the                                  printing and typesetting industry. Lorem Ipsum has been the 
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.

请问您能说明如何提取上述ID#值的方法吗?是否有任何标准技术可用于此处以提取此信息?例如RegEx或RegEx顶部的一些方法。可以在这里申请NLP吗?

2 个答案:

答案 0 :(得分:1)

以下是我头脑中的建议。一般的想法是将源文本转换为行数组(或列表),然后迭代它们直到找到" ID#"令牌。一旦知道ID#在该行中的位置,然后遍历其余行以在该位置找到一些文本。这个例子应该与你给出的例子一起使用,虽然任何不同都可能导致它返回错误的值。

String s = null; //your source text
String idValue = null; //what we'll assign the ID value to

//format the string into lines
String[] lines = s.split("\\r?\\n"); //this handles both Windows and Unix-style line termination

//go through the lines looking for the ID# token and storing it's horizontal position in in the line
for (int i=0; i<lines.length; i++) {
    String line = lines[i];
    int startIndex = line.indexOf("ID#");

    //if we found the ID token, then go through the remaining lines starting from the next one down
    if (startIndex > -1) {
        for (int j=i+1; j<lines.length; j++) {
            line = lines[j];

            //check if this line is long enough
            if (line.length() > startIndex) {

                //remove everything prior to the index where the ID# token was
                line = line.substring(startIndex);

                //if the line starts with a space then it's not an ID
                if (!line.startsWith(" ")) {

                    //look for the first whitespace after the ID value we've found
                    int endIndex = line.indexOf(" ");

                    //if there's no end index, then the ID is at the end of the line
                    if (endIndex == -1) {
                        idValue = line;
                    } else {
                        //if there is an end index, then remove everything to just leave the ID value
                        idValue = line.substring(0, endIndex);
                    }

                    break;
                }
            }
        }

        break;
    }

}

答案 1 :(得分:1)

似乎没有明确的ID值格式,因此单行正则表达式无法帮助,因为这里几乎没有任何常规。

您必须使用两个正则表达式来实现预期输出。第一个是:

(?m)^(.*)ID#.*([\s\S]*)

它试图单独在行中找到ID#。它捕获了两个字符串块。第一个块是从该行开始到ID#的所有内容,然后是ID#所在行之后出现的所有内容。

然后我们计算第一个捕获组的长度。它为我们提供了列号,我们应该在下一行开始搜索ID:

m.group(1).length();

然后我们构建使用这个长度的第二个正则表达式:

(?m)^.{X}(?<!\S)\h{0,3}(\S+)

故障:

  • (?m)启用多线模式
  • ^匹配行的开头
  • .{X}匹配前X个字符(X为m.group(1).length()
  • (?<!\S)检查当前位置是否位于空格字符之前
  • \h{0,3}匹配水平空格,最多可选3个字符(如果值向右移动)
  • (\S+)捕获非空白字符

然后我们将此正则表达式运行到先前正则表达式的第二个捕获组:

Matcher m = Pattern.compile("(?m)^(.*)ID#.*([\\s\\S]*)").matcher(string);                  
if (m.find()) {
    Matcher m1 = Pattern.compile("(?m)^.{" + m.group(1).length() + "}(?<!\\S)\\h{0,3}(\\S+)").matcher(m.group(2));
    if (m1.find())
        System.out.println(m1.group(1));
}

Live demo

相关问题