Question

我想从html文件中提取文件，这些文件放在parapraph（p）和link（a href）标签之间。我想要没有 java正则表达式和html解析器。我想知道< / p>

while ((word = reader.readLine()) !=null) { //iterate to the end of the file
    if(word.contains("<p>")) { //catching p tag
        while(!word.contains("</p>") { //iterate to the end of that tag
            try { //start writing
                out.write(word);
            } catch (IOException e) {
            }
        }
    }
}

但是没有工作。代码似乎对我很有用。读者如何能够捕获“p”和“a href”标签。

Answer 1

当您在一行中有类似<p>blah</p>之类的内容时，问题就开始了。一个简单的解决方案是将所有<更改为\n< - 类似这样的内容：

boolean insidePar = false;
while ((line = reader.readLine()) !=null) {
    for(String word in line.replaceAll("<","\n<").split("\n")){
        if(word.contains("<p>")){
            insidePar = true;
        }else if(word.contains("</p>")){
            insidePar = false;
        }
        if(insidePar){ // write the word}
    }
}

我还建议使用像@HovercraftFullOfEels这样的解析器库。

编辑：我已经更新了代码，所以它更接近工作版本，但可能在那里一路走来会遇到更多问题。

Answer 2

我认为使用库会更容易。使用此http://jsoup.org/。您还可以解析String

从html文件中提取某些文本

2 个答案: