Question

我正在抓取此网页供个人使用https://asheville.craigslist.org/search/fua并遇到问题，提取页面上每个项目的缩略图。我用的时候＆＃34;检查＆＃34;要查看html DOM我可以查看包含我需要的.jpg的图像标记，但是当我使用＆＃34;查看页面源＆＃34;时，img标记不会显示。起初我认为这可能是一个异步的javascript加载问题，但我被一个可靠的来源告诉我应该能够直接用beautifulsoup刮掉缩略图。

import lxml
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

    r = requests.get("https://asheville.craigslist.org/search/fua", params=dict(postal=28804), headers={"user-agent":ua.chrome})
    soup = BeautifulSoup(r.content, "lxml")
    for post in soup.find_all('li', "result-row"):
        for post_content in post.findAll("a", "result-image gallery"):
            print(post_content['href'])
            for pic in post_content.findAll("img", {'alt class': 'thumb'}):
                print(pic['src'])

有人可以澄清我在这里的误解吗？来自＆＃34; a＆＃34;的href属性的值标签将打印，但我似乎无法获得＆＃34; img＆＃34;的src属性。标签打印。提前谢谢！

Answer 1

我可以使用以下代码阅读img代码：

for post in soup.find_all('li', "result-row"):
    for post_content in post.find_all("a", "result-image gallery"):
        print(post_content['href'])
        for pic in post_content.find_all("img"):
            print(pic['src'])

关于从craigslist中抓取的几点想法：

每秒限制您的请求数。我听说如果你超过一定的请求频率，craigslist会对你的IP地址设置临时阻止。
每个帖子似乎都加载了一到两张图片。仔细检查后，除非您单击箭头，否则不会加载轮播图像。如果您需要每张照片的每张照片，您应该找到一种不同的方式来编写脚本，可能是通过访问每个包含多个图像的帖子的链接。

另外，我认为使用硒进行卷筒纸刮擦是很棒的。你可能不需要它用于这个项目，但它可以让你做更多的事情，比如点击按钮，输入表格数据等。这里是我用来使用Selenium抓取数据的快速脚本：

import lxml
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def test():
    url = "https://asheville.craigslist.org/search/fua"
    driver = webdriver.Firefox()
    driver.get(url)
    html = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(html, "lxml")
    for post in soup.find_all('li', "result-row"):
        for post_content in post.find_all("a", "result-image gallery"):
            print(post_content['href'])
            for pic in post_content.find_all("img"):
                print(pic['src'])

使用BeautifulSoup

1 个答案: