如何使用术语接受页面刮取网站?

时间:2013-09-25 20:41:17

标签: java scrape

我是编写代码的新手,我正在尝试编写代码来抓取特定网站。问题是这个网站有一个页面接受使用条件和隐私页面。网站可以看到这一点:http://cpdocket.cp.cuyahogacounty.us/

我需要以某种方式绕过这个页面,我不知道如何。我正在用Java编写我的代码,到目前为止,已经有工作代码可以删除任何网站的源代码。这段代码是:

import java.net.URL;
import java.net.URLConnection;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.StringBuilder;
import java.io.IOException;

// Scraper class takes an input of a string, and returns the source code of the of the website
public class Scraper { 

  private static String url; // the input website to be scraped

  //constructor
  public Scraper(String url) {
    this.url = url;
  }

  //scrapeWebsite runs the method to scrape the input variable. As of now it retuns a string. This string idealy should be saved
  //so it is able to be parsed by another method
public static String scrapeWebsite() throws IOException {
            URL urlconnect = new URL(url); //creates the url from the variable
            URLConnection connection = urlconnect.openConnection(); // connects to the created url
            BufferedReader in = new BufferedReader(new InputStreamReader( 
                    connection.getInputStream(), "UTF-8")); // annonymous class to stream the website
            String inputLine; //creates a new variable of string
            StringBuilder a = new StringBuilder(); // creates stringbuilder
            //loop appends to the string builder as long as there is information
            while ((inputLine = in.readLine()) != null)
                a.append(inputLine);
            in.close();

            return a.toString();
        }
} 

非常感谢任何关于如何做到这一点的建议。

我正在根据ruby代码重写代码。代码是:

def initializeSession()
    ## SETUP # POST headers
    post_header = Hash.new()
    post_header['Host'] = 'cpdocket.cp.cuyahogacounty.us'
    post_header['User-Agent'] = 'Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0'
    post_header['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
    post_header['Accept-Language'] = 'en-US,en;q=0.5'
    post_header['Accept-Encoding'] = 'gzip, deflate'
    post_header['X-Requested-With'] = 'XMLHttpRequest'
    post_header['X-MicrosoftAjax'] = 'Delta=true'
    post_header['Cache-Control'] = 'no-cache'
    post_header['Content-Type'] = 'application/x-www-form-urlencoded; charset=utf-8'
    post_header['Referer'] = 'http://cpdocket.cp.cuyahogacounty.us/Search.aspx' # may have to alter this per request
    # post_header['Content-Length'] = '12197'
    post_header['Connection'] = 'keep-alive'
    post_header['Pragma'] = 'no-cache'



    # STEP  # set up simulated browser and make first request
    #browser = SimBrowser.new()
    #logname = 'log.txt'
    #s = Scribe.new(logname)
    session_cookie = 'ASP.NET_SessionId'
    url = 'http://cpdocket.cp.cuyahogacounty.us/'
    @browser.http_get(url)
    #puts browser.get_body() # debug
    puts 'DEBUG: session cookie: ' + @browser.get_cookie_var(session_cookie)
    @log.slog('DEBUG: home page response code: expected 200, actual ' + @browser.get_response().code)
    # s.flog('### HOME PAGE RESPONSE')
    # s.flog(browser.get_body()) # debug

    # STEP # send our acceptance of the terms of service
    data = {
      'ctl00$SheetContentPlaceHolder$btnYes' => 'Yes',
      '__EVENTARGUMENT'=>'',
      '__EVENTTARGET'=>'',
      '__EVENTVALIDATION'=>'/wEWBwKc78CQCQLn3/HqCQLZw/fZCgLipuudAQK42duKDQL33NjnAwKn6+K4CIM3TSmrbrsn2xBRJf2DRwg01Vsbdk+oJV9lhG/in+xD',
      '__VIEWSTATE'=>'/wEPDwUKLTI4MzA1ODM0OA9kFgJmD2QWAgIDD2QWDgIDD2QWAgIBD2QWCAIBDxYCHgRUZXh0BQ9BbmRyZWEgRi4gUm9jY29kAgMPFgIfAAUfQ3V5YWhvZ2EgQ291bnR5IENsZXJrIG9mIENvdXJ0c2QCBQ8PFgIeB1Zpc2libGVoZGQCBw8PFgIfAWhkZAIHDw9kFgIeB29uY2xpY2sFGmphdmFzY3JpcHQ6d2luZG93LnByaW50KCk7ZAILDw9kFgIfAgUiamF2YXNjcmlwdDpvbkNsaWNrPXdpbmRvdy5jbG9zZSgpO2QCDw8PZBYCHwIFRmRpc3BsYXlQb3B1cCgnaF9EaXNjbGFpbWVyLmFzcHgnLCdteVdpbmRvdycsMzcwLDIyMCwnbm8nKTtyZXR1cm4gZmFsc2VkAhMPZBYCZg8PFgIeC05hdmlnYXRlVXJsBRMvVE9TLmFzcHg/aXNwcmludD1ZZGQCFQ8PZBYCHwIFRWRpc3BsYXlQb3B1cCgnaF9RdWVzdGlvbnMuYXNweCcsJ215V2luZG93JywzNzAsMzcwLCdubycpO3JldHVybiBmYWxzZWQCFw8WAh8ABQYxLjAuNTRkZEnXSWiVLEPsDmlc7dX4lH/53vU1P1SLMCBNASGt4T3B'
    }
    #post_header['Referer'] = url
    @browser.http_post(url, data, post_header)
    @log.slog('DEBUG: accept terms response code:  expected 200, actual ' + @browser.get_response().code)
    @log.flog('### TOS ACCPTANCE RESPONSE')
    # @log.flog(@browser.get_body()) # debug    
  end

这可以用Java完成吗?

1 个答案:

答案 0 :(得分:0)

如果您不明白如何操作,最好的学习方法是手动执行此操作,同时观看FireBug(在Firefox上)或IE,Chrome或Safari的等效工具。

当用户接受条款和条件时,您必须在代码中复制任何内容。手动条件。

您还必须意识到,呈现给用户的UI可能不会直接以HTML格式发送,它可能由通常在浏览器上运行的Javascript动态构建。如果您不准备完全模拟浏览器以维护DOM并执行Javascript,那么这可能是不可能的。