Question

我目前正在编写一个python的解析器，用于自动从网站中提取一些信息。我正在使用mechanize来浏览网站。我获得了以下HTML代码：

<html>
 <head>
  <title>
   XXXXX
  </title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8; no-cache;" />
  <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
  <link rel="stylesheet" href="/rr/style_other.css" type="text/css" />
 </head>
 <frameset cols="*,370" border="1">
  <frame src="affiche_cp.php?uid=yyyyyyy&amp;type=entree" name="cdrg" />
  <frame src="affiche_bp.php?uid=yyyyyyy&amp;type=entree" name="cdrd" />
 </frameset>
</html>

我想访问这两个框架：

cdrd

在cdrg我将获得提交的结果

我该怎么做？

Answer 1

就个人而言，我不使用BeautifulSoup来解析HTML。但我使用PyQuery，这是类似的，但我喜欢CSS选择器语法而不是XPath。我还使用Requests发出HTTP请求。

仅此一项就足以抓取数据并提交请求。它可以做你想要的。我理解这可能不是您正在寻找的答案，但它可能对您有用。

使用PyQuery抓取框架

import requests
import pyquery

response = requests.get('http://example.com')
dom = pyquery.PyQuery(response.text)
frames = dom('frame')

frame_one = frames[0]
frame_two = frames[1]

发出HTTP请求

import requests

response = requests.post('http://example.com/signup', data={
    'username': 'someuser',
    'password': 'secret'
})

response_text = response.text

data是一个字典，其中包含要提交给表单的POST数据。您应该使用Chrome的网络资源管理器，Fiddlr或Burp Suite来监控请求。虽然监控手动提交两个表格。检查HTTP请求并使用Requests重新创建请求。

希望有所帮助。我在这个领域工作，所以如果你需要更多的信息，请随时打我。

Answer 2

我的问题的解决方案是加载第一帧并填写此页面中的表单。然后我加载第二帧，我可以读取它并获得与第一帧中的表格相关的结果。

使用python进行动态浏览（Mechanize，Beautifulsoup ...）

2 个答案: