使用lxml和python

时间:2016-03-05 12:04:08

标签: python dom web-crawler lxml scraper

我正在尝试使用python和lxml抓取Google新闻。一切进展顺利,但当我尝试使用for循环打印每个div数据时,一切都搞砸了。 这是我的代码:

# -*- coding: utf-8 -*-

from stem import Signal
from stem.control import Controller
from lxml import html
from lxml import cssselect
from lxml import etree
import requests

proxies = {
    'http' : 'http://127.0.0.1:8123'
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}

url = "https://www.google.it/search?hl=en&tbm=nws&as_occt=any&tbs=cdr:1,cd_min:9/1/2014,cd_max:9/1/2014,sbd:1&as_nsrc=Daily%20Mail&start=0"

page = requests.get(url,proxies=proxies,headers=headers)
tree = html.fromstring(page.content)
results = tree.xpath('//div[@class="_cnc"]')

for div in results:
    print(div)

我得到了这个输出:

<Element div at 0x7f4154df9470>
<Element div at 0x7f4154df94c8>
<Element div at 0x7f4154df9520>
<Element div at 0x7f4154df9578>
<Element div at 0x7f4154df95d0>
<Element div at 0x7f4154df9628>
<Element div at 0x7f4154df9680>
<Element div at 0x7f4154df96d8>
<Element div at 0x7f4154df9730>
<Element div at 0x7f4154df9788>

我想从每个div中提取 - &gt; title,href和snippet,有这样的东西:

....

for div in results:
    title = div.xpath('//a[@class="l _HId"]/text()')
    href = div.xpath('//a[@class="l _HId"]/@href')
    snippet = div.xpath('//div[@class="st"]/text()')
    #for example
    print(title)
....

当我尝试打印时,我得到相同的多输出:

['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']
['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']
['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']
['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']
['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']
['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']
['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']
['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']
['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']
['Pro-Russian rebels lower demands in peace talks', "'If I want to I can take Kiev in a fortnight': Putin's threat to Europe ", 'Showing a yen for business: PM Modi and Japanese premier Abe ', 'Modi visit draws pledges of support from Japan', 'The spectre of the Army rises again over Pakistan', 'Protesters briefly storm Pakistan state TV station', 'Anti-government protesters storm Pakistan state television station ', 'Austerity debate flares as Europe recovery fades', 'Inquiries begin into nude celebrity photo leaks', 'He tried to sell intimate pictures of Jennifer Lawrence in return for ']

有人知道我的代码有什么问题吗?

1 个答案:

答案 0 :(得分:0)

你几乎就在那里 - 只需将点添加到内部XPath表达式中,使它们特定于当前节点的上下文

for div in results:
    title = div.xpath('.//a[@class="l _HId"]/text()')
    href = div.xpath('.//a[@class="l _HId"]/@href')
    snippet = div.xpath('.//div[@class="st"]/text()')
    #for example
    print(title)
相关问题