Question

我正在尝试编写一个Java服务，该服务从数据库获取链接，访问此链接并下载每个链接中存在的css类找到的特定图像（我获得了WikiData链接列表，并且我必须下载实体图片，该图片的链接位于div下，其类别为“ fullImageLink”）

我尝试使用Crawler4j的imagecrawler *示例，但是它获取了链接，请访问该页面，并开始将存在的每个图像下载到该页面以及找到的每个链接中，除了我需要的图像。

* https://github.com/yasserg/crawler4j/tree/master/crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/imagecrawler

Answer 1

import java.util.*;
import java.net.*;
import java.io.*;

import Torello.HTML.*;
import Torello.HTML.NodeSearch.*;
import Torello.HTML.Tools.*;
// See:  http://developer.torello.directory/JavaHTML/index.html

public class test
{
    public static void main(String[] argv) throws IOException
    {
        URL               mainURL  = new URL("wiki_url");
        Vector<HTMLNode>  page     = HTMLPage.getPageTokens(mainURL, false);
        Vector<TagNode>   anchors  = InnerTagGet.all(page, "a", "class", TextComparitor.CONTAINS, "fullImageLink");
        Vector<URL>       aURLs    = Links.resolveHREFs(anchors, mainURL);
        Vector<TagNode>   images   = InnerTagGet.all(page, "img", "class", TextComparitor.CONTAINS, "fullImageLink");
        Vector<URL>       iURLs    = Links.resolveSRCs(images, mainURL);

        ImageScraper.Results r = new ImageScraper(iURLs, "imgDownloadDir/").download();
    }
}

使用类从WikiData链接下载图像

1 个答案: