如何在div另一个标记中提取span标记

时间:2018-01-17 14:33:59

标签: python html beautifulsoup

我使用Beautiful Soup在python中编写了一个代码,用于从IMDB中提取用户名和评级。但是有很多用户没有给他们的评论评分。很难准确地将评级与他们的评论进行对比。那我该怎么做呢?     http://www.imdb.com/title/tt2866360/reviews?ref_=tt_ov_rt 在此网址评论未指定评级。

url1 ="http://www.imdb.com/title/tt2866360/reviews?ref_=tt_ov_rt"

response = requests.get(url1, headers=headers)

page=response.content

soup=BeautifulSoup(page)

for k in soup.findAll('div',{"class":"load-more-data"}):

    if k.name == 'span' and m['class'] == "rating-other-user-rating":
        print blah()
    else:
        print blah 1()

这是检查评级部分是否存在于评论部分但是没有返回任何内容的代码?

2 个答案:

答案 0 :(得分:4)

您正在寻找的信息(用户名,评级)位于“div.review-container”标签中 关于没有评级的标签,您可以忽略它们。

for k in soup.find_all('div',{"class":"review-container"}):
    rating = k.find('span', class_='rating-other-user-rating')
    if rating:
        rating = ''.join(i.text for i in rating.find_all('span')[-2:])
    name = k.find('span', class_='display-name-link').text
    print name, rating

按下“加载更多”按钮时显示的信息是通过XHR请求加载的 您将找到所需的所有数据,以便在'div.load-more-data'标记中执行请求。

load_more = soup.find('div', class_='load-more-data')
url = 'http://www.imdb.com{}?paginationKey={}'.format(
    load_more['data-ajaxurl'], load_more['data-key']
    )
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

重复上述过程,直到获得所有信息。

import requests
from bs4 import BeautifulSoup

url = "http://www.imdb.com/title/tt2866360/reviews?ref_=tt_ov_rt"
ajax_url = url.split('?')[0] + "/_ajax?paginationKey={}"
reviews = []

while True:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')

    for k in soup.find_all('div',{"class":"review-container"}):
        rating = k.find('span', class_='rating-other-user-rating')
        if rating:
            rating = ''.join(i.text for i in rating.find_all('span')[-2:])
        name = k.find('span', class_='display-name-link').text
        reviews.append([name, rating])
        print name, rating

    load_more = soup.find('div', class_='load-more-data')
    if not load_more:
        break
    url = ajax_url.format(load_more['data-key'])

答案 1 :(得分:0)

我建议您尝试安排每次审核的<div class="review-container" ...内容。然后选择要检索的特定数据。