Question

以下代码从以下网站的评论中提取Arbeitsatmosphare和Stadt数据。但是提取是基于索引方法的，因此，如果我们不想提取Arteitsatmosphare，而希望提取图像（rating_tags[12]），则会出错，因为有时我们只审查2或3个项目。

我想更新此代码以获取以下输出。如果我们没有图片，请使用0或不适用。

         Arbeitsatmosphare | Stadt     | Image | 
   1.      4.00            | Berlin    | 4.00  |
   2.      5.00            | Frankfurt | 3.00  |
   3.      3.00            | Munich    | 3.00  |
   4.      5.00            | Berlin    | 2.00  |
   5.      4.00            | Berlin    | 5.00  |

我的代码在下面

import requests
from bs4 import BeautifulSoup
import pandas as  pd

arbeit = []
stadt = []
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
        response = session.get(url)

        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all('article')
        print("Number of articles: " + str(len(articles)))
        for article in articles:

            rating_tags = article.find_all('span', {'class' : 'rating-badge'})

            arbeit.append(rating_tags[0].text.strip())


            detail_div = article.find_all('div', {'class' : 'review-details'})[0]
            nodes = detail_div.find_all('li')
            stadt_node = nodes[1]
            stadt_node_div = stadt_node.find_all('div')
            stadt_name = stadt_node_div[1].text.strip()
            stadt.append(stadt_name)

        page += 1

        pagination = soup.find_all('div', {'class' : 'paginationControl'})
        if not pagination:
            break

df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt})
print(df)

beautifulsoup通过搜索的类提取数据

0 个答案: