用htmlunit下载并获取原始文件名?

时间:2015-04-25 13:18:34

标签: java web-scraping htmlunit

我想创建一个小应用程序,下载并安装/升级我的所有Windows软件 但是有越来越多讨厌的javascript系统。

我尝试了幻影,但它无法下载 我刚尝试了htmlunit,它可以很好地下载或获取原始文件名 我无法同时做到这两点。我的代码不起作用。

package com.example.simpledownloader;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.logging.Level;
import org.apache.commons.io.FilenameUtils;

public class Main {

    public static void main(String[] args) throws Exception {
        testDownload();
    }

    public static void testDownload() throws IOException {

        // Turn htmlunit warnings off.
        java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);

        // Init web client and navigate to the first page.
        final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_31);
        final HtmlPage page1 = webClient.getPage("http://www.videohelp.com/software/AV-Splitter");

        // Get the anchor element.
        String xpath1 = "//*[@id=\"Main\"]/div/div/div[11]/table[1]/tbody/tr[3]/td[2]/a[6]";
        HtmlElement element = (HtmlElement) page1.getByXPath(xpath1).get(0);

        // Extract the original filename from the filepath.
        String filepath = element.click().getUrl().getFile();
        String filename = FilenameUtils.getName(filepath);
        System.out.println(filename);

        // Download the file.
        InputStream inputStream = element.click().getWebResponse().getContentAsStream();
        FileOutputStream outputStream = new FileOutputStream(filename);
        int read;
        byte[] bytes = new byte[1024];
        while ((read = inputStream.read(bytes)) != -1) {
            outputStream.write(bytes, 0, read);
        }

        // Close the webclient.
        webClient.close();
    }
}

获取文件名有效,但下载没有。

我发现了这个错误:

Exception in thread "main" java.lang.RuntimeException: java.io.FileNotFoundException: C:\Users\Admin\AppData\Local\Temp\htmlunit46883917986334906.tmp (The system cannot find the file specified)

可能是因为我已点击获取文件名?

1 个答案:

答案 0 :(得分:1)

实际上,您点击两次。

怎么样:

// Extract the original filename from the filepath.
Page page2 = element.click();
String filepath = page2.getUrl().getFile();
String filename = FilenameUtils.getName(filepath);
System.out.println(filename);

// Download the file.
InputStream inputStream = page2.getWebResponse().getContentAsStream();
FileOutputStream outputStream = new FileOutputStream(filename);
int read;
byte[] bytes = new byte[1024];
while ((read = inputStream.read(bytes)) != -1) {
    outputStream.write(bytes, 0, read);
}