我想从具有相同类的两个不同表中获取或选择数据。
我尝试从'soup.find_all'获取它,但是格式化数据变得越来越困难。
有许多具有相同类的表。我只需要从表中获取值(没有标签)。
URL:https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/
表1:
<div class="bh_collapsible-body" style="display: none;">
<table border="0" cellpadding="2" cellspacing="2" class="prop-list">
<tbody>
<tr>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Rim Material</td>
<td class="value">Alloy</td>
</tr>
</tbody>
</table>
</td>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Front Tyre Description</td>
<td class="value">215/55 R16</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Front Rim Description</td>
<td class="value">16x7.0</td>
</tr>
</tbody>
</table>
</td>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Rear Tyre Description</td>
<td class="value">215/55 R16</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Rear Rim Description</td>
<td class="value">16x7.0</td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</div> // I thing this is a extra close </div>
表2:
<div class="bh_collapsible-body" style="display: none;">
<table border="0" cellpadding="2" cellspacing="2" class="prop-list">
<tbody>
<tr>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Steering</td>
<td class="value">Rack and Pinion</td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</div>// I thing this is a extra close </div>
我尝试过的事情:
我尝试从Xpath获取第一个表的内容,但同时给出了值和标签。
table1 = driver.find_element_by_xpath("//*[@id='features']/div/div[5]/div[2]/div[1]/div[1]/div/div[2]/table/tbody/tr[1]/td[1]/table/tbody/tr/td[2]")
我试图拆分数据,但是没有用。提供了页面的网址,以备您查看
答案 0 :(得分:3)
这不是一个完美的解决方案,但是如果您愿意稍微翻阅数据,我建议为此使用pandas的read_html函数。
pandas的read_html提取网页中的所有html表并将其转换为pandas数据帧数组。
此代码似乎可以在您链接的页面中获取所有82个表格元素:
import pandas as pd
import requests
url = "https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/"
#Need to add a fake header to avoid 403 forbidden error
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
resp = requests.get(url, headers=header)
table_dataframes = pd.read_html(resp.text)
for i, df in enumerate(table_dataframes):
print(f"================Table {i}=================\n")
print(df)
这将打印出网页中存在的所有82个表格。局限性在于您将不得不手动查找您感兴趣的表并进行相应的操作。似乎表71和74是您想要的表。
此方法需要附加的智能才能使其自动化可行。
答案 1 :(得分:3)
这两个表的目标有些“棘手”,因为它们包含其他表。我使用CSS选择器table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr))
来定位第一个表,使用相同的选择器和字符串"Steering"
来定位第二个表:
from bs4 import BeautifulSoup
import requests
url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'
headers = {'User-Agent':'Mozilla/5.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
rows = []
for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
rows.append([td.get_text(strip=True) for td in tr.select('td')])
for label, text in rows:
print('{: <30}: {}'.format(label, text))
打印:
Steering : Rack and Pinion
Rim Material : Alloy
Front Tyre Description : 215/55 R16
Front Rim Description : 16x7.0
Rear Tyre Description : 215/55 R16
Rear Rim Description : 16x7.0
编辑:要从多个URL获取数据:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0'}
urls = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/',
'https://www.redbook.com.au/cars/details/2019-genesis-g80-38-ultimate-auto-my19/SPOT-ITM-520697/']
for url in urls:
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
rows = []
for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
rows.append([td.get_text(strip=True) for td in tr.select('td')])
print('{: <30}: {}'.format('Title', soup.h1.text))
print('-' * (len(soup.h1.text.strip())+32))
for label, text in rows:
print('{: <30}: {}'.format(label, text))
print('*' * 80)
打印:
Title : 2019 Honda Civic 50 Years Edition Auto MY19
---------------------------------------------------------------------------
Steering : Rack and Pinion
Rim Material : Alloy
Front Tyre Description : 215/55 R16
Front Rim Description : 16x7.0
Rear Tyre Description : 215/55 R16
Rear Rim Description : 16x7.0
********************************************************************************
Title : 2019 Genesis G80 3.8 Ultimate Auto MY19
-----------------------------------------------------------------------
Steering : Rack and Pinion
Rim Material : Alloy
Front Tyre Description : 245/40 R19
Front Rim Description : 19x8.5
Rear Tyre Description : 275/35 R19
Rear Rim Description : 19x9.0
********************************************************************************
答案 2 :(得分:0)
您不必一次xpath
做一次。您可以使用xpath
获取所有<table class=prop-list>
,然后使用索引从列表中选择表,并使用另一个xpath从该表中获取值
我为此使用BeautifulSoup,但与xpath应当类似
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'
text = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text
soup = BS(text, 'html.parser')
all_tables = soup.find_all('table', {'class': 'prop-list'}) # xpath('//table[@class="prop-list"]')
#print(len(all_tables))
print("\n--- Engine ---\n")
all_labels = all_tables[3].find_all('td', {'class': 'label'}) # xpath('.//td[@class="label"]')
all_values = all_tables[3].find_all('td', {'class': 'value'}) # xpath('.//td[@class="value"]')
for label, value in zip(all_labels, all_values):
print('{}: {}'.format(label.text, value.text))
print("\n--- Fuel ---\n")
all_labels = all_tables[4].find_all('td', {'class': 'label'})
all_values = all_tables[4].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
print('{}: {}'.format(label.text, value.text))
print("\n--- Stearing ---\n")
all_labels = all_tables[7].find_all('td', {'class': 'label'})
all_values = all_tables[7].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
print('{}: {}'.format(label.text, value.text))
print("\n--- Wheels ---\n")
all_labels = all_tables[8].find_all('td', {'class': 'label'})
all_values = all_tables[8].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
print('{}: {}'.format(label.text, value.text))
结果:
--- Engine ---
Engine Type: Piston
Valves/Ports per Cylinder: 4
Engine Location: Front
Compression ratio: 10.6
Engine Size (cc) (cc): 1799
Engine Code: R18Z1
Induction: Aspirated
Power: 104kW @ 6500rpm
Engine Configuration: In-line
Torque: 174Nm @ 4300rpm
Cylinders: 4
Power to Weight Ratio (W/kg): 82.6
Camshaft: OHC with VVT & Lift
--- Fuel ---
Fuel Type: Petrol - Unleaded ULP
Fuel Average Distance (km): 734
Fuel Capacity (L): 47
Fuel Maximum Distance (km): 940
RON Rating: 91
Fuel Minimum Distance (km): 540
Fuel Delivery: Multi-Point Injection
CO2 Emission Combined (g/km): 148
Method of Delivery: Electronic Sequential
CO2 Extra Urban (g/km): 117
Fuel Consumption Combined (L/100km): 6.4
CO2 Urban (g/km): 202
Fuel Consumption Extra Urban (L/100km): 5
Emission Standard: Euro 5
Fuel Consumption Urban (L/100km): 8.7
--- Stearing ---
Steering: Rack and Pinion
--- Wheels ---
Rim Material: Alloy
Front Tyre Description: 215/55 R16
Front Rim Description: 16x7.0
Rear Tyre Description: 215/55 R16
Rear Rim Description: 16x7.0
我假设所有页面都具有相同的表,并且具有相同的编号。