Question

我正在使用scrapy抓取整个页面。不知何故，正则表达式是错误的。

这是我的部分：

def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath('//li')
        items = []
        for titles in titles:
            item = CraigslistSampleItem()
            item["title"] = titles.xpath("a/text()").extract()
            item["link"] = titles.xpath("a/@href").extract()
            items.append(item)
        return(items)

我想解析<li>中的所有链接，获取网址和锚标记。

Answer 1

您不需要将response对象转换为HtmlXPathSelector，因为它是默认设置 - 只有当您执行一些讨厌的事情并加载文件并将其提供给parse_items函数时

我会尝试

for title in titles:
    item = CraigslistSampleItem()
    item["title"] = title.xpath("./a/text()").extract()
    item["link"] = title.xpath("./a/@href").extract()
    items.append(item)

您过度使用titles变量：包含li标记的列表以及每个元素的变量。这基本上是错误的。使用title作为循环中的变量。

但是，如果您在a下的HTML中有多个li标记，则应该考虑使用其他方法，因为您将获得一个网址及其标题列表。

Scrapy遍历所有链接

1 个答案: