JAVA解析特殊字符

时间:2016-07-30 18:35:58

标签: java html

我有一个收集一些HTML数据的程序。

public class Uni_Extract {
    public static void main(String[] args) throws Exception {
        System.out.println("Started");

        String csvFile = "C://Users/Kennedy/Desktop/university.csv";
        FileWriter writer = new FileWriter(csvFile);

        for (int i=2; i<=2; i++){
            String url = "http://www.4icu.org/reviews/index"+i+".htm";
            Document doc = Jsoup.connect(url).userAgent("Mozilla").get();

            Elements cells = doc.select("td.i");

            Iterator<Element> iterator = cells.iterator();  
            while (iterator.hasNext()) {
                Element cell = iterator.next();

                String university = Jsoup.parse((cell.select("a").text())).text();
                university = StringEscapeUtils.escapeHtml(university);
                String country = cell.nextElementSibling().select("img").attr("alt");
                System.out.printf("country : %s, university : %s %n", country, university);
            }
        }
        writer.flush();
        writer.close();
    }
}

但是,我的程序遇到一些特殊的字符时,会返回原始的HTML代码。我该如何解析它们?

例如,它将返回包含“ö”作为特殊字符的AzerbaycanDövletPedaqojiUniversiteti?我怎么能解决它和其他类似的情况?

1 个答案:

答案 0 :(得分:1)

稍微简化一下代码并删除对escapeHtml的调用后,一切似乎都能正常工作。这是我的代码和相关的输出行:

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

import java.io.*;
import java.util.*;

public class Test
{
    public static void main(String[] args) throws IOException {
        System.out.println("Started");

        String url = "http://www.4icu.org/reviews/index2.htm";
        Document doc = Jsoup.connect(url).userAgent("Mozilla").get();

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();  
        while (iterator.hasNext()) {
            Element cell = iterator.next();

            String university = Jsoup.parse((cell.select("a").text())).text();
            String country = cell.nextElementSibling().select("img").attr("alt");
            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

输出:

...
country : Azerbaijan, university : Azerbaycan Dövlet Aqrar Universiteti
...