Question

我正在使用beautifulsoup进行网页抓取 https://readwrite.com/category/fintech/

这是代码，它进展顺利：

SELECT T.currency_id, T.mindate
FROM (
    SELECT * , MIN( DATE ) AS mindate
    FROM investments
    GROUP BY investments.currency_id
    ORDER BY mindate ASC
) AS T
JOIN currencies ON T.currency_id = 
currencies.currency_id

问题是它只获得了8个标题，因为默认情况下页面仅加载8但如果您滚动更新并显示更多新闻。

我想废弃更多文章。

Answer 1

当您向下滚动时，该页面会使xhr请求加载更多文章，因为请求已完成，javascript会在已加载的数据之后附加新数据。（这种技术称为无限滚动）
如果您查看浏览器中的网络标签，您可以看到它正在请求这些网址：
https://readwrite.com/category/fintech/?paged1=2
https://readwrite.com/category/fintech/?paged1=3 等

所以你只需要逐步废弃这些网址。

Answer 2

您可以执行此操作来解析该页面中的所有标题：

from bs4 import BeautifulSoup
import requests
import urllib.request 
from urllib.request import Request, urlopen


page_no = 0
page_link = "https://readwrite.com/category/fintech/?paged1={}"

while True:
    page_no+=1
    res = urllib.request.Request(page_link.format(page_no))
    page = urllib.request.urlopen(req).read()
    soup = BeautifulSoup(page,'lxml')
    container = soup.select('article')
    if len(container)<=1:break 

    for content in container:
        title = content.select_one(".title a").text
        print(title)

通过滚动更新页面的Web抓取

2 个答案: