所有<a> tags under a span tag using scrapy

时间:2016-11-18 00:33:51

标签: python scrapy web-crawler

I am using scrapy to extract data from web. I am trying to extract the text of anchor tags under a span tag as shown below:

<span>.....</span>
<span id = "size_selection_list">
    <a>....</a>
    <a>....</a>
    .
    .
    .
    <a>
</span>

I am using the following xpath logic:

t = sel.xpath('//div[starts-with(@id,"size_selection_container")]/span[2]')
for x in t.xpath('.//a'):
....

The problem is that the span element is reached but the <a> tags are not iterated. What is the mistake here? Also the <a> has an href which has javascript. Is this the reason for the problem?

2 个答案:

答案 0 :(得分:0)

如果我愿意,我会使用requestsBeautifulSoup4

请注意,此代码未经测试,但应该工作。

import requests
from bs4 import BeautifulSoup
r = requests.get(yoururlhere).text
soup = BeautifulSoup(r, 'html.parser') #You can use LXML or other things, I am using the standard parser for compatibility
span = div.find('div', {'class': 'theclass'}
tags = span.findAll('a', href=True)
for i in tags:
    print(i.getText()) #getText might not be a function, consider removing the extra ()
    print(i['href']) #<-- This is the links, above is the text

我希望这有效,请让我知道

答案 1 :(得分:0)

这是我能做的一切,你的HTML代码不完整。

import lxml.html
string = '''<span>.....</span>
<span id = "size_selection_list">
    <a>....</a>
    <a>....</a>
    .
    .
    .
    <a>....</a>
</span>'''

html = lxml.html.fromstring(string)
for a in html.xpath('//span[@id="size_selection_list"]//a'):
    print(a.tag)

出:

a
a
a