HTML表格特定行刮痧

时间:2018-03-16 21:23:29

标签: python python-3.x selenium selenium-webdriver lxml

我想从特定的this table行中抓取数据。我只想要橙色/金色的行。 以前,我使用SIM提供的代码来抓取整个表信息,之后我操作它:

from selenium.webdriver import Chrome
from contextlib import closing
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

URL = "https://www.n2yo.com/passes/?s=39090&a=1"

chrome_options = Options()  
chrome_options.add_argument("--headless")

with closing(Chrome(chrome_options=chrome_options)) as driver:
    driver.get(URL)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    for items in soup.select("#passestable tr"):
        data = [item.text for item in items.select("th,td")]
        print(data)

我不确定如何更改此代码以仅获取橙色/金色行。我在解析时尝试将颜色代码作为标记进行搜索,但它不起作用。任何和所有建议都赞赏。

感谢您的时间。

3 个答案:

答案 0 :(得分:3)

您可以使用正则表达式匹配颜色:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re
d = driver.Chrome()
d.get("https://www.n2yo.com/passes/?s=39090&a=1")
s = soup(d.page_source, 'lxml')
data = [i.text for i in s.find_all('tr', {'bgcolor':re.compile('#FFFFFF|#FFFF33|#FFCC00')})]

输出:

[u'16-Mar 20:34N12\xb020:42W265\xb079\xb020:48SSW199\xb0-Map and details', u'17-Mar 07:51S178\xb007:58W260\xb052\xb008:05NNW341\xb0-Map and details', u'17-Mar 20:00NNE19\xb020:08E102\xb050\xb020:14S180\xb0-Map and details', u'18-Mar 07:17SSE160\xb007:24E83\xb077\xb007:31N349\xb0-Map and details', u'18-Mar 08:58SW217\xb009:04W269\xb013\xb009:09NW323\xb0-Map and details', u'18-Mar 21:06N6\xb021:13WNW295\xb041\xb021:19SW217\xb0-Map and details', u'19-Mar 06:43SE142\xb006:50ENE67\xb038\xb006:57N356\xb0-Map and details', u'19-Mar 08:23SSW196\xb008:30W268\xb027\xb008:36NNW333\xb0-Map and details', u'19-Mar 20:32N12\xb020:39WNW286\xb084\xb020:46SSW198\xb0-Map and details', u'20-Mar 07:48S177\xb007:55WSW254\xb055\xb008:02NNW342\xb0-Map and details', u'20-Mar 19:58NNE20\xb020:05E98\xb047\xb020:12S178\xb0-Map and details', u'21-Mar 07:14SSE159\xb007:22NE58\xb072\xb007:28N349\xb0-Map and details', u'21-Mar 08:55SW216\xb009:01W272\xb014\xb009:07NW325\xb0-Map and details', u'21-Mar 21:03N6\xb021:10WNW288\xb043\xb021:17SW215\xb0-Map and details', u'22-Mar 06:41SE141\xb006:48ENE70\xb036\xb006:54N356\xb0-Map and details', u'22-Mar 08:20S194\xb008:27W265\xb029\xb008:34NNW335\xb0-Map and details', u'22-Mar 20:29N13\xb020:36N348\xb086\xb020:43SSW196\xb0-Map and details', u'23-Mar 07:46S176\xb007:53W265\xb059\xb008:00NNW343\xb0-Map and details', u'23-Mar 19:55NNE20\xb020:02E94\xb045\xb020:09S177\xb0-Map and details', u'24-Mar 07:12SSE157\xb007:19ENE71\xb069\xb007:26N350\xb0-Map and details', u'24-Mar 08:53SW214\xb008:59W270\xb015\xb009:04NW325\xb0-Map and details', u'24-Mar 21:01N7\xb021:08WNW292\xb046\xb021:14SW214\xb0-Map and details', u'25-Mar 06:38SE139\xb006:45ENE65\xb034\xb006:52N357\xb0-Map and details', u'25-Mar 08:18S193\xb008:24W263\xb030\xb008:31NNW335\xb0-Map and details', u'25-Mar 18:49NE39\xb018:54E87\xb010\xb018:59SE134\xb0-Map and details', u'25-Mar 20:27N13\xb020:34SSE161\xb086\xb020:41S195\xb0-Map and details']

答案 1 :(得分:2)

尝试替换此行

for items in soup.select("#passestable tr"):

这一个

for items in soup.select("#passestable tr[bgcolor='#FFCC00'], #passestable tr[bgcolor='#FFFF33']"):

迭代仅需要颜色的tr个节点

请注意,这将返回所有橙色节点,然后才返回所有黄金节点

答案 2 :(得分:2)

您可以尝试使用selenium的另一种方法:

from lxml.html import fromstring
import requests

r = requests.get(URL)
html = fromstring((r.content).decode('utf-8'))
# only orange and yellow rows
rows = html.xpath('//tr[@bgcolor="#FFFF33" or @bgcolor="#FFCC00"]')
相关问题