如何使用美丽的汤从标签中提取数据

时间:2017-08-06 00:08:11

标签: python beautifulsoup

我正在尝试从网站检索数据。我的代码如下:

import re
from urllib2 import urlopen
from bs4 import BeautifulSoup

# gets a file-like object using urllib2.urlopen
url = 'http://ecal.forexpros.com/e_cal.php?duration=weekly'
html = urlopen(url)

soup = BeautifulSoup(html)

# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr'
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}):
    # finds desired data by looking up <td> elements with class names

    event = tr.find('td', {'class': 'ec_td_event'}).text
    currency = tr.find('td', {'class': 'ec_td_currency'}).text
    actual = tr.find('td', {'class': 'ec_td_actual'}).text
    forecast = tr.find('td', {'class': 'ec_td_forecast'}).text
    previous = tr.find('td', {'class': 'ec_td_previous'}).text
    time = tr.find('td', {'class': 'ec_td_time'}).text
    importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')

    # the returned strings are unicode, so to print them we need to use a unicode string
    if importance == 'High':
        print(u'\t{:5}\t{}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, importance, currency, event, actual, forecast, previous))

结果集中的前几条记录如下:

05:00   High    EUR CPI (YoY)                                   1.3%        1.3%        1.3%    
10:00   High    USD Pending Home Sales (MoM)                    1.5%        0.7%        -0.7%   
21:45   High    CNY Caixin Manufacturing PMI                    51.1        50.4        50.4    
00:30   High    AUD RBA Interest Rate Decision                  1.50%       1.50%       1.50%   
00:30   High    AUD RBA Rate Statement                                                          
03:55   High    EUR German Manufacturing PMI                    58.1        58.3        58.3    
03:55   High    EUR German Unemployment Change                  -9K         -5K         6K      

我正在尝试从以下网站检索类似数据:

https://www.fxstreet.com/economic-calendar

为此,我修改了上述代码如下:

import re
from urllib2 import urlopen
from bs4 import BeautifulSoup

# gets a file-like object using urllib2.urlopen
url = 'https://www.fxstreet.com/economic-calendar'
html = urlopen(url)

soup = BeautifulSoup(html)


for tr in soup.find_all('tr', {'class': re.compile('fxst-tr-event fxst-oddRow  fxit-eventrow fxst-evenRow ')}):
    # finds desired data by looking up <div> elements with class names

    event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
    currency = tr.find('div', {'class': 'fxit-event-name'}).text
    actual = tr.find('div', {'class': ' fxit-actual'}).text
    forecast = tr.find('div', {'class': 'fxit-consensus'}).text
    previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text
    time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
#    importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')

    # the returned strings are unicode, so to print them we need to use a unicode string
    if importance == 'High':
        print(u'\t{:5}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, currency, event, actual, forecast, previous))

此代码不会返回任何结果(可能是因为我引用了错误的标记和/或类)。有谁看到我的错误在哪里?

谢谢!

1 个答案:

答案 0 :(得分:1)

您应该使用 selenium + Chromedriver / PhantomJS 来解析动态创建的JavaScript内容urllib2没有办法解决这个问题。我不认为在这里使用regex很有意义,您可以使用lxml解析器来允许多个类并在列表中使用它们。以下是使用已经提到的工具的示例:

from bs4 import BeautifulSoup
from selenium import webdriver

url = 'https://www.fxstreet.com/economic-calendar'

driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

for tr in soup.findAll('tr',{'class':['fxst-tr-event', 'fxst-oddRow', 'fxit-eventrow', 'fxst-evenRow', 'fxs_cal_nextEvent']}):
    event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
    currency = tr.find('div', {'class': 'fxit-event-name'}).text
    actual = tr.find('div', {'class': 'fxit-actual'}).text
    forecast = tr.find('div', {'class': 'fxit-consensus'}).text
    previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text
    time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text

    print(time, currency, event, actual, forecast, previous)

注意lxml本身就是一个库,您可以使用标准html.parser处理多个类,但在我看来并不直观。此代码打印:

14:00 
CAD                                     14:00 None 59.2 
61.6                                    
14:00 
CAD                                     14:00 52.9  
63.9                                    
17:00 
USD                                     17:00 765 
...
...

我没有改变任何变量,因为我不确定你想要它们是什么,所以进一步调整它并格式化输出应该是理想的。