如何使用MyCrawler.java Controller.java文件运行crawler4j.jar

时间:2013-01-19 10:38:57

标签: web-crawler crawler4j

我是抓手的新手,我想运行我的第一个抓取程序。我有三个文件

  1. Crawler4j.jar
  2. Mycrawler.java
  3. Controller.java 当我在终端输入 javac -cp crawler4j-3.1.jar MyCrawler.java Controller.java 时出现以下错误:
  4. MyCrawler.java:32: cannot find symbol
    symbol  : method getText()
    location: class edu.uci.ics.crawler4j.crawler.Page
    String text = page.getText();
                      ^
    MyCrawler.java:33: cannot find symbol`enter code here`
    symbol  : method getURLs()
    location: class edu.uci.ics.crawler4j.crawler.Page
    ArrayList links = page.getURLs();
                          ^
    Controller.java:5: cannot find symbol
    symbol  : constructor CrawlController(java.lang.String)
    location: class edu.uci.ics.crawler4j.crawler.CrawlController
    CrawlController controller = new CrawlController("/data/crawl/root");
                                 ^
    3 errors"
    
    我在哪里弄错了? 谢谢

1 个答案:

答案 0 :(得分:1)

你应该编写一个Controller和一个Crawler。

这是Controller.java文件:

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

public class Controller {
public static void main(String[] args) throws Exception {


    RobotstxtConfig robotstxtConfig2 = new RobotstxtConfig();

    System.out.println(robotstxtConfig2.getCacheSize());
    System.out.println(robotstxtConfig2.getUserAgentName());

    String crawlStorageFolder = "/crawler/testdata";
    int numberOfCrawlers = 4;
    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);

    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

    System.out.println(robotstxtConfig.getCacheSize());
    System.out.println(robotstxtConfig.getUserAgentName());

    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, 
                 pageFetcher, robotstxtServer);

    controller.addSeed("http://cyesilkaya.wordpress.com/");
    controller.start(Crawler.class, numberOfCrawlers);
  }
   }

这是Crawler.java文件:

   import java.io.IOException;
   import org.jsoup.Jsoup;
   import org.jsoup.nodes.Document;
   import org.jsoup.nodes.Element;
   import edu.uci.ics.crawler4j.crawler.Page;
   import edu.uci.ics.crawler4j.crawler.WebCrawler;
   import edu.uci.ics.crawler4j.url.WebURL;

   public class Crawler extends WebCrawler {

    @Override
    public boolean shouldVisit(WebURL url) {
         // you can write your own filter to decide crawl the incoming URL or not.
        return true;
    }

    @Override
    public void visit(Page page) {          
        String url = page.getWebURL().getURL();
        try {
        // Do whatever you want with the crawled page    
    }
    catch (IOException e) {
    }
      }
   }

只需运行Controller.class