刮美丽的汤

时间:2015-04-16 07:02:04

标签: python beautifulsoup

我偶然发现了使用Beautiful Soup进行刮擦的优秀post,我决定承担从互联网上抓取一些数据的任务。

我正在使用Flight Radar 24中的航班数据,并使用博文中描述的内容尝试自动搜索飞行数据页面。

import requests
import bs4

root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'


def get_flight_id_urls():
     response = requests.get(index_url)
     soup = bs4.BeautifulSoup(response.text)
     return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]


flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
    temp_url = root_url + flight_id_url
    response = requests.get(temp_url)
    soup = bs4.BeautifulSoup(response.text)

try:
    table = soup.find('table')
    rows = table.find_all('tr')
    for row in rows:
        flight_data = {}
        flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
        flight_data['tr'] = row #error here
        print (flight_data)

except AttributeError as e:
    raise ValueError("No valid table found")

flight data page

的样本

我跌跌撞撞地走到桌边,然后意识到我不知道如何横向移动表属性以获取嵌入每列的数据。

任何善良的灵魂都有任何线索,甚至是介绍的教程,以便我可以阅读如何提取数据。

P.S:获得Miguel Grinberg的优秀教程

已添加

try:
table = soup.find('table')
rows = table.find_all('tr')
heads = [i.text.strip() for i in table.select('thead th')]
for tr in table.select('tbody tr'):
    flight_data = {}
    flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
    flight_data['From'] = tr.select('td.From') 
    flight_data['To'] = tr.select('td.To')

    print (flight_data)

except AttributeError as e:
     raise ValueError("No valid table found")

我更改了代码的最后一部分以形成数据对象,但我似乎无法获取数据。

最终修改:

import requests
import bs4

root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'


def get_flight_id_urls():
     response = requests.get(index_url)
     soup = bs4.BeautifulSoup(response.text)
     return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]


flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
    temp_url = root_url + flight_id_url
    response = requests.get(temp_url)
    soup = bs4.BeautifulSoup(response.text)

try:
    table = soup.find('table')
    rows = table.find_all('tr')
    for row in rows:
        flight_data = {}
        flight_data['flight_number'] = tr['data-flight-number']
        flight_data['from'] = tr['data-name-from']
        print (flight_data)

except AttributeError as e:
    raise ValueError("No valid table found")
P.S.S:感谢@amow的大力帮助:D

1 个答案:

答案 0 :(得分:4)

以html中的表格table开头。

heads = [i.text.strip() for i in table.select('thead th')]
for tr in table.select('tbody tr'):
    datas = [i.text.strip() for i in tr.select('td')]
    print dict(zip(heads, datas))

<强>输出

{   
    u'STD': u'06:30',   
    u'Status': u'Scheduled',   
    u'ATD': u'-',  
    u'From': u'Singapore  (SIN)',  
    u'STA': u'07:55',  
    u'\xa0': u'', #This is the last column and have no meaning  
    u'To': u'Penang  (PEN)',  
    u'Aircraft': u'-',  
    u'Date': u'2015-04-19'
}

如果要获取tr标签中的数据。只需使用

tr['data-data'] tr['data-flight-number']

等等。