如何阅读别人的论坛

时间:2010-01-13 21:03:18

标签: ruby httpwebrequest mechanize screen-scraping

我的朋友有一个论坛,里面有很多包含信息的帖子。有时她想查看她论坛中的帖子,并得出结论。目前,她通过点击她的论坛来评论帖子,并生成一个不一定准确的数据图片(在她的大脑中),她从中得出结论。我今天的想法是,我可能会发出一个快速的Ruby脚本,它会解析必要的HTML,让她真正了解数据的含义。

我今天第一次使用Ruby的net / http库,我遇到了一个问题。虽然我的浏览器可以轻松查看我朋友的论坛,但似乎Net :: HTTP.new(“forumname.net”)方法会产生以下错误:

无法建立连接,因为目标计算机主动拒绝它。 - 连接(2)

谷歌搜索这个错误,我已经知道它与MySQL(或类似的东西)有关,不希望像我这样的爱管闲事的人在那里远程探索:出于安全原因。这对我来说很有意义,但它让我想知道:我的浏览器是如何在我朋友的论坛上找到的,但我的小Ruby脚本没有任何戳戳权利。我的脚本是否有某种方式告诉服务器它不是威胁?我只想要阅读权而不是写权利?

谢谢你们,

Ž。

2 个答案:

答案 0 :(得分:6)

抓一个网站?使用mechanize

#!/usr/bin/ruby1.8

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get("http://xkcd.com")
page = page.link_with(:text=>'Forums').click
page = page.link_with(:text=>'Mathematics').click
page = page.link_with(:text=>'Math Books').click
#puts page.parser.to_html    # If you want to see the html you just got
posts = page.parser.xpath("//div[@class='postbody']")
for post in posts
  title = post.at_xpath('h3//text()').to_s
  author = post.at_xpath("p[@class='author']//a//text()").to_s
  body = post.xpath("div[@class='content']//text()").collect do |div|
    div.to_s
  end.join("\n")
  puts '-' * 40
  puts "title: #{title}"
  puts "author: #{author}"
  puts "body:", body
end

输出的第一部分:

----------------------------------------
title: Math Books
author: Cleverbeans
body:
This is now the official thread for questions about math books at any level, fr\
om high school through advanced college courses.
I'm looking for a good vector calculus text to brush up on what I've forgotten.\
 We used Stewart's Multivariable Calculus as a baseline but I was unable to pur\
chase the text for financial reasons at the time. I figured some things may hav\
e changed in the last 12 years, so if anyone can suggest some good texts on thi\
s subject I'd appreciate it.
----------------------------------------
title: Re: Multivariable Calculus Text?
author: ThomasS
body:
The textbooks go up in price and new pretty pictures appear. However, Calculus \
really hasn't changed all that much.
If you don't mind a certain lack of pretty pictures, you might try something li\
ke Widder's Advanced Calculus from Dover. it is much easier to carry around tha\
n Stewart. It is also written in a style that a mathematician might consider no\
rmal. If you think that you might want to move on to real math at some point, i\
t might serve as an introduction to the associated style of writing.

答案 1 :(得分:1)

某些网站只能使用“www”子域进行访问,因此可能会导致问题。

要创建get请求,您可能需要使用Get方法:

require 'net/http'

url = URI.parse('http://www.forum.site/')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
  http.request(req)
}
puts res.body

你可能还需要在某个时候将用户代理设置为一个选项:

{'User-Agent' => 'Mozilla/5.0 (Windows; U;
    Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1'})