我应该如何修改以解析Google新闻搜索文章标题&预览&网址是什么?

时间:2016-09-22 02:57:16

标签: java parsing jsoup google-search-api

我想解析Google新闻搜索:1)文章名称2)预览3)网址

为了执行此操作,我应该在Web结构中进行修改。

Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

主要在这里:

  

(“。g> .r> .a”)

如何修改?

完整代码:

  public static void main(String[] args) throws UnsupportedEncodingException, IOException {

    String google = "http://www.google.com/search?q=";

    String search = "stackoverflow";

    String charset = "UTF-8";

    String news="&tbm=nws";


    String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!

    Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

    for (Element link : links) {
        String title = link.text();
        String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
        url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");

        if (!url.startsWith("http")) {
            continue; // Ads/news/etc.
        }
        System.out.println("Title: " + title);
        System.out.println("URL: " + url);
    }
}

更新

enter image description here

1 个答案:

答案 0 :(得分:1)

如何选择正确的元素(使用chrome)

第一步:在你的浏览器中禁用javascript(例如为方便起见,使用类似uMatrix的添加),这样你就可以看到与jsoup相同的结果。

现在右键单击一个元素并选择检查或使用Ctrl + Shift + I打开开发工具。将鼠标悬停在“元素”选项卡中的源代码上时,可以在呈现的页面中看到相关元素。右键单击源中的n元素提供copy - &gt;复制选择器。这是一个很好的起点,但有时候太严格了。这里它为选择器#rso > div:nth-child(3)提供了一个id为rs的元素中的第三个直接子div。这太具体了,所以我们概括一下:

我们为id为#rso > div的元素选择所有直接子div。

然后我们抓住标题主播h3 > a,textnode和属性href会产生标题和网址。

接下来,我们使用类st(div.st)获取内部div,其中包含textnode中的预览。如果缺少该div,我们将跳过该元素。

在请求中使用.data("key","value"),我们不需要手动编码。

示例代码

String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String searchTerm = "stackoverflow";
int numberOfResultpages = 2; // grabs first two pages of search results
String searchUrl = "https://www.google.com/search?";

Document doc;

for (int i = 0; i < numberOfResultpages; i++) {

    try {
        doc = Jsoup.connect(searchUrl)
                .userAgent(userAgent)
                .data("q", searchTerm)
                .data("tbm", "nws")
                .data("start",""+i)
                .method(Method.GET)
                .referrer("https://www.google.com/").get();

        for (Element result : doc.select("#rso > div")) {

            if(result.select("div.st").size()==0) continue;

            Element h3a = result.select("h3 > a").first();

            String title = h3a.text();
            String url = h3a.attr("href");
            String preview = result.select("div.st").first().text();

            // just printing out title and link to demonstate the approach
            System.out.println(title + " -> " + url + "\n\t" + preview);
        }

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

<强>输出

Stack Overflow: Movie Magic -> https://geekdad.com/2016/09/stack-overflow-movie-magic-2/
    I got to visit the set of Kubo and the Two Strings and see some of the amazing work that went into creating the film. But well before the ...
Will StackOverflow Documentation Realize Its Lofty Goal? -> https://dzone.com/articles/will-stackoverflow-documentation-realize-its-lofty
    With the StackOverflow Documentation project now in beta, how close is it to realizing the lofty goals it has set forth for itself? Can it ever ...
Stack Overflow: Progress Report -> https://geekdad.com/2016/09/stack-overflow-progress-report/
    Of the books on my list, the only one I totally finished so far is Kidding Ourselves, which I included in this Stack Overflow. And that perhaps is an ...
....