正则表达式不匹配

时间:2014-10-26 16:03:19

标签: java regex web-scraping jsoup

我正在尝试编写一个从网站中提取信息的小程序。我只想获得两个字符串之间的某些信息,“ORIGIN”和“//”。我没有在代码中出现任何错误,但由于某种原因我无法将信息打印到屏幕上。有人能指出我做错了吗?

import java.io.IOException;
import java.io.PrintStream; 
import java.io.FileOutputStream;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.*;


class main {
    public static void main(String[] args) throws IOException {

        Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=293762&db=nuccore&dopt=genbank&extrafeat=976&fmt_mask=0&retmode=html&withmarkup=on&log$=seqview&maxplex=3&maxdownloadsize=1000000").get();

        String text = doc.text();
        String pattern1 = "ORIGIN";  
        String pattern2 = "//";
        String regexString = Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2);

        Pattern pattern = Pattern.compile(regexString, Pattern.MULTILINE); 
        Matcher matcher = pattern.matcher(text);


        while (matcher.find()) {
            String textInBetween = matcher.group(1); 
        }

        Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2));
        Matcher m = p.matcher(text);
        while (m.find()) {
            System.out.println(m.group(1));
        }

    }
}

2 个答案:

答案 0 :(得分:1)

您需要使用DOTALL标记来匹配任何可能的换行符

Pattern pattern = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + 
                            Pattern.quote(pattern2), Pattern.DOTALL);

答案 1 :(得分:0)

您必须使用DOTALL修饰符编译模式:

Pattern pattern = Pattern.compile(regexString, Pattern.MULTILINE | Pattern.DOTALL); 
Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2), Pattern.DOTALL);

此修饰符允许句点.匹配包含新行的每个字符。没有它们,dot匹配除新行之外的每个字符