以下代码从以下网站的评论中提取Arbeitsatmosphare和Stadt数据。但是提取是基于索引方法的,因此,如果我们不想提取Arteitsatmosphare,而希望提取图像(rating_tags[12]
),则会出错,因为有时我们只审查2或3个项目。
我想更新此代码以获取以下输出。如果我们没有图片,请使用0或不适用。
Arbeitsatmosphare | Stadt | Image |
1. 4.00 | Berlin | 4.00 |
2. 5.00 | Frankfurt | 3.00 |
3. 3.00 | Munich | 3.00 |
4. 5.00 | Berlin | 2.00 |
5. 4.00 | Berlin | 5.00 |
我的代码在下面
import requests
from bs4 import BeautifulSoup
import pandas as pd
arbeit = []
stadt = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
rating_tags = article.find_all('span', {'class' : 'rating-badge'})
arbeit.append(rating_tags[0].text.strip())
detail_div = article.find_all('div', {'class' : 'review-details'})[0]
nodes = detail_div.find_all('li')
stadt_node = nodes[1]
stadt_node_div = stadt_node.find_all('div')
stadt_name = stadt_node_div[1].text.strip()
stadt.append(stadt_name)
page += 1
pagination = soup.find_all('div', {'class' : 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt})
print(df)