如何使用BeautifulSoup

时间:2018-03-22 16:50:52

标签: python html beautifulsoup

我想从新闻网站中检索不同的类别。我正在使用BeautifulSoup从右侧获得文章标题。如何循环到网站左侧的各种类别?我刚刚开始学习这种代码,而不是理解它是如何工作的。任何帮助将不胜感激。这是我正在研究的网站。 http://query.nytimes.com/search/sitesearch/#/ * / 下面是我的代码,它从右侧返回各种文章的标题:

import json
from bs4 import BeautifulSoup
import urllib
from urllib2 import urlopen 
from urllib2 import HTTPError 
from urllib2 import URLError
import requests


resp = urlopen("https://query.nytimes.com/svc/add/v1/sitesearch.json")

content = resp.read()
j = json.loads(content)

articles = j['response']['docs']
headlines = [ article['headline']['main'] for article in articles ]
for article in articles:
    print article['headline']['main']

2 个答案:

答案 0 :(得分:2)

如果我理解正确,您可以通过更改api查询来获取这些文章:

import requests

data_range = ['24hours', '7days', '30days', '365days']
news_feed = {}

with requests.Session() as s:

   for rng in data_range:
        news_feed[rng] = s.get('http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}ago&facet=true'.format(rng)).json()

并访问以下值:

print(news_feed) #or print(news_feed['30days'])

修改

要查询附加页面,您可以尝试:

import requests

data_range = ['7days']
news_feed = {}
news_list = []
page = 1

with requests.Session() as s:
   for rng in data_range:
        while page < 20: #this is limited to 120
            news_list.append(s.get('http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}ago&page={}&facet=true'.format(rng, page)).json())
            page += 1
        news_feed[rng] = news_list

for new in news_feed['7days']:
    print(new)

答案 1 :(得分:1)

首先,您可以使用requests模块及其内置的.json()函数,而不是使用urllib + json来解析JSON响应。 / p>

示例:

import requests

r = requests.get("https://query.nytimes.com/svc/add/v1/sitesearch.json")
json_data = r.json()
# rest of the code is same

现在,要抓取Date Range标签,请先转到Developer Tools&gt; Network&gt; XHR。然后,单击任何选项卡。例如,如果单击Past 24 Hours选项卡,您将看到对此URL发出的AJAX请求:

http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date=24hoursago&facet=true

如果点击Past 7 Days,您会看到以下网址:

http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date=7daysago&facet=true

通常,您可以使用以下格式设置这些网址格式:

url = "http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}&facet=true"
past_24_hours = url.format('24hoursago')

r = requests.get(past_24_hours)
data = r.json()

这将为您提供JSON对象data中的所有NEWS项目。

例如,你可以获得这样的新闻标题:

for item in data['response']['docs']:
    print(item['headline']['main'])

输出:

Austrian Lawmakers Vote to Hinder Smoking Ban in Restaurants and Bars
Soccer-Argentine World Cup Winner Houseman Dies Aged 64
Response to UK Spy Attack Not Expected at EU Summit: French Source
Florida Man Reunites With Pet Cat Lost 14 Years Ago
Citigroup Puts Restrictions on Gun Sales
EU Exemptions From U.S. Steel Tariffs 'Possible but Not Certain': French Source
Trump Initiates Trade Action Against China
Trump’s Trade Threats Put China’s Leader on the Spot
Poland Plans Concessions in Judicial Reforms to Ease EU Concerns: Lawmaker
Florida Bridge Collapse Victim's Family Latest to Sue