Question

我是Python新手，我正在尝试从网站上抓取数据，但我需要所有页面，到目前为止我有：

import requests
from bs4 import BeautifulSoup


r = requests.get ("http://www.somesite.com/records/08-jan-2016/")
r.content
soup = BeautifulSoup(r.content, "html.parser")
full_info = soup.find_all("div", {"class": "col-sm-10"})

for item in full_info : print (item.text)

此代码打印当前页面的数据，如何设置从所有页面获取数据并导出到文件。

祝你好运

Answer 1

因此，要添加评论中提出的问题，如何迭代多个日期。我不是最熟练的程序员，但我会使用键创建一个字典：值 =＆gt;月份中的月份：天数。然后，您可以创建一个嵌套循环来创建要追加到URL的字符串。

#include <limits.h>
#include <stddef.h>
#include <stdio.h>

#define digit_count(num) (1                                /* sign            */ \
                        + sizeof (num) * CHAR_BIT / 3      /* digits          */ \
                        + (sizeof (num) * CHAR_BIT % 3 > 0)/* remaining digit */ \
                        + 1)                               /* NUL terminator  */

int main(void) {
    short short_number = -32767;
    int int_number = 32767;
    char short_buffer[digit_count(short_number)] = { 0 };
    char int_buffer[digit_count(int_number)];
    sprintf(short_buffer, "%d", short_number);
    sprintf(int_buffer, "%d", int_number);
}

Answer 2

就个人而言，我会使用datetime库进行日期算术 - 这就是它的设计目的。但是，由于datetime的strftime是基于区域设置的，因此手动构建字符串会更安全，除非您打算在与网站匹配的已知区域设置上运行此字符串。

import datetime
MONTH_NAMES = {1: 'jan', 2: 'feb', 3: 'mar'}  # and so on
ONE_DAY = datetime.timedelta(1)

def date_strings(first_date, last_date):
    current_date = first_date
    while current_date <= last_date:
        yield '{0.day:02}-{1}-{0.year:04}'.format(
            current_date, MONTH_NAMES[current_date.month])
        # If running on a US locale, you can just use:
        # yield current_date.strftime('%d-%b-%Y').lower()
        current_date += ONE_DAY

first_date = datetime.date(2016, 1, 8)
last_date = datetime.date(2016, 3, 29)

for date_string in date_strings(first_date, last_date):
    print(date_string)
    # Do whatever scraping you need using date_string

使用Python + Beautiful Soup刮取网站4所有页面

2 个答案: