使用BeautifulSoup4解析HTML表

时间:2018-03-19 16:52:50

标签: python web-scraping beautifulsoup html-parsing

我是BeautifulSoup的新手并试图提取表格。我已经按照文档来做一个嵌套的for循环来提取单元格数据,但它只返回前三行。这是我的代码:

from six.moves import urllib
from bs4 import BeautifulSoup
import pandas as pd

def get_url_content(url):
    try:
        html=urllib.request.urlopen(url)
    except urllib.error.HTTPError as e:
        return None
    try:
        soup=BeautifulSoup(html.read(),'html.parser')
    except AttributeError as e:
        return None
    return soup

URL="http://www.megamillions.com/winning-numbers/search?startDate=1/1/2017&endDate=3/31/2018"
soup=get_url_content(URL)

for tr in soup.find_all('tr'):
    for td in tr.find_all('td'):
        print td.text

我也尝试在第二个循环之前添加if语句:

if tr.parentGenerator=='tbody':

但它返回空列表。

3 个答案:

答案 0 :(得分:2)

该网站是动态的,这意味着您需要使用浏览器操作工具,例如selenium。然后,为每次搜索从多个类名中提取文本:

import urllib
import re
from bs4 import BeautifulSoup as soup
from selenium import webdriver
def get_table():
   d = webdriver.Chrome('path/to/driver') #or webdriver.Firefox(), depending on your browser
   d.get('http://www.megamillions.com/winning-numbers/search?startDate=1/1/2017&endDate=3/31/2018')
   table = [i.text for i in soup(d.page_source, 'lxml').find_all('td', {'class':re.compile('dates|number|mega|details')})]
   final_table = [table[i:i+9] for i in range(0, len(table), 9)]
   last_data = [dict(zip(['draw_date', 'balls', 'megaball', 'megaplier', 'details'], [a, b, c, d, e])) for a, *b, c, d, e in final_table]
   return last_data

print(get_table())

输出:

[{'draw_date': '12/29/2017', 'balls': ['4', '10', '18', '28', '62'], 'megaball': '7', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '12/26/2017', 'balls': ['10', '12', '20', '38', '41'], 'megaball': '25', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '12/22/2017', 'balls': ['1', '20', '30', '33', '42'], 'megaball': '16', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '12/19/2017', 'balls': ['28', '37', '39', '42', '58'], 'megaball': '2', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '12/15/2017', 'balls': ['4', '12', '36', '44', '57'], 'megaball': '19', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '12/12/2017', 'balls': ['8', '23', '24', '25', '27'], 'megaball': '9', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '12/8/2017', 'balls': ['6', '37', '46', '60', '70'], 'megaball': '24', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '12/5/2017', 'balls': ['14', '15', '37', '42', '67'], 'megaball': '22', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '12/1/2017', 'balls': ['16', '22', '40', '41', '59'], 'megaball': '8', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '11/28/2017', 'balls': ['10', '17', '47', '51', '61'], 'megaball': '5', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '11/24/2017', 'balls': ['16', '36', '54', '61', '64'], 'megaball': '22', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '11/21/2017', 'balls': ['3', '7', '22', '27', '50'], 'megaball': '3', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '11/17/2017', 'balls': ['3', '26', '55', '58', '70'], 'megaball': '15', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '11/14/2017', 'balls': ['1', '14', '21', '22', '28'], 'megaball': '19', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '11/10/2017', 'balls': ['6', '23', '38', '42', '58'], 'megaball': '24', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '11/7/2017', 'balls': ['1', '54', '60', '68', '69'], 'megaball': '11', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '11/3/2017', 'balls': ['10', '22', '42', '61', '69'], 'megaball': '3', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '10/31/2017', 'balls': ['6', '28', '31', '52', '53'], 'megaball': '12', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '10/27/2017', 'balls': ['17', '27', '41', '51', '52'], 'megaball': '13', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '10/24/2017', 'balls': ['20', '24', '34', '56', '64'], 'megaball': '6', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '10/20/2017', 'balls': ['6', '23', '63', '66', '73'], 'megaball': '9', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '10/17/2017', 'balls': ['31', '45', '49', '56', '70'], 'megaball': '11', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '10/13/2017', 'balls': ['2', '7', '18', '26', '31'], 'megaball': '12', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '10/10/2017', 'balls': ['7', '16', '24', '61', '62'], 'megaball': '2', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '10/6/2017', 'balls': ['21', '33', '36', '45', '56'], 'megaball': '12', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '10/3/2017', 'balls': ['12', '18', '19', '25', '67'], 'megaball': '7', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '9/29/2017', 'balls': ['25', '51', '62', '73', '74'], 'megaball': '7', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '9/26/2017', 'balls': ['1', '10', '57', '66', '75'], 'megaball': '4', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '9/22/2017', 'balls': ['5', '39', '54', '63', '66'], 'megaball': '15', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '9/19/2017', 'balls': ['9', '28', '31', '50', '61'], 'megaball': '10', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '9/15/2017', 'balls': ['18', '24', '34', '38', '58'], 'megaball': '3', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '9/12/2017', 'balls': ['26', '37', '41', '54', '65'], 'megaball': '3', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '9/8/2017', 'balls': ['4', '5', '14', '26', '73'], 'megaball': '14', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '9/5/2017', 'balls': ['11', '17', '59', '70', '72'], 'megaball': '1', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '9/1/2017', 'balls': ['4', '13', '31', '50', '64'], 'megaball': '12', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '8/29/2017', 'balls': ['2', '13', '17', '35', '73'], 'megaball': '3', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '8/25/2017', 'balls': ['17', '38', '42', '51', '65'], 'megaball': '11', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '8/22/2017', 'balls': ['24', '35', '46', '50', '51'], 'megaball': '7', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '8/18/2017', 'balls': ['1', '31', '34', '40', '75'], 'megaball': '6', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '8/15/2017', 'balls': ['7', '16', '20', '66', '73'], 'megaball': '7', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '8/11/2017', 'balls': ['23', '33', '53', '56', '58'], 'megaball': '6', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '8/8/2017', 'balls': ['11', '17', '50', '52', '74'], 'megaball': '14', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '8/4/2017', 'balls': ['9', '17', '25', '63', '71'], 'megaball': '4', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '8/1/2017', 'balls': ['20', '22', '52', '57', '73'], 'megaball': '7', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '7/28/2017', 'balls': ['4', '6', '31', '49', '52'], 'megaball': '11', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '7/25/2017', 'balls': ['2', '5', '26', '58', '60'], 'megaball': '6', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '7/21/2017', 'balls': ['18', '31', '36', '50', '74'], 'megaball': '10', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '7/18/2017', 'balls': ['8', '12', '23', '51', '73'], 'megaball': '6', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '7/14/2017', 'balls': ['11', '12', '24', '32', '73'], 'megaball': '1', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '7/11/2017', 'balls': ['7', '18', '24', '55', '74'], 'megaball': '10', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '7/7/2017', 'balls': ['2', '9', '11', '28', '60'], 'megaball': '10', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '7/4/2017', 'balls': ['16', '39', '47', '53', '71'], 'megaball': '15', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '6/30/2017', 'balls': ['10', '38', '51', '55', '64'], 'megaball': '6', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '6/27/2017', 'balls': ['4', '21', '45', '52', '57'], 'megaball': '14', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '6/23/2017', 'balls': ['12', '20', '53', '66', '74'], 'megaball': '11', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '6/20/2017', 'balls': ['2', '15', '41', '49', '63'], 'megaball': '3', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '6/16/2017', 'balls': ['18', '22', '26', '30', '44'], 'megaball': '9', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '6/13/2017', 'balls': ['27', '51', '62', '68', '75'], 'megaball': '8', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '6/9/2017', 'balls': ['3', '16', '28', '33', '37'], 'megaball': '9', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '6/6/2017', 'balls': ['3', '5', '16', '49', '75'], 'megaball': '5', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '6/2/2017', 'balls': ['7', '42', '57', '69', '72'], 'megaball': '10', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '5/30/2017', 'balls': ['5', '20', '32', '37', '67'], 'megaball': '5', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '5/26/2017', 'balls': ['25', '26', '28', '37', '56'], 'megaball': '5', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '5/23/2017', 'balls': ['6', '13', '17', '33', '60'], 'megaball': '14', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '5/19/2017', 'balls': ['1', '4', '5', '24', '30'], 'megaball': '1', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '5/16/2017', 'balls': ['4', '35', '39', '56', '72'], 'megaball': '11', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '5/12/2017', 'balls': ['28', '34', '41', '42', '47'], 'megaball': '13', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '5/9/2017', 'balls': ['6', '29', '45', '69', '73'], 'megaball': '11', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '5/5/2017', 'balls': ['4', '23', '33', '47', '53'], 'megaball': '7', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '5/2/2017', 'balls': ['5', '14', '42', '43', '58'], 'megaball': '1', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '4/28/2017', 'balls': ['6', '13', '18', '20', '31'], 'megaball': '13', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '4/25/2017', 'balls': ['3', '13', '33', '40', '50'], 'megaball': '2', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '4/21/2017', 'balls': ['1', '12', '13', '32', '34'], 'megaball': '10', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '4/18/2017', 'balls': ['8', '29', '30', '43', '64'], 'megaball': '6', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '4/14/2017', 'balls': ['5', '10', '55', '60', '73'], 'megaball': '12', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '4/11/2017', 'balls': ['19', '34', '35', '38', '49'], 'megaball': '8', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '4/7/2017', 'balls': ['30', '33', '43', '47', '69'], 'megaball': '15', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '4/4/2017', 'balls': ['13', '24', '34', '35', '55'], 'megaball': '9', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '3/31/2017', 'balls': ['17', '24', '27', '32', '58'], 'megaball': '10', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '3/28/2017', 'balls': ['30', '33', '35', '37', '46'], 'megaball': '10', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '3/24/2017', 'balls': ['5', '28', '37', '61', '69'], 'megaball': '1', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '3/21/2017', 'balls': ['4', '45', '53', '73', '75'], 'megaball': '7', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '3/17/2017', 'balls': ['11', '27', '31', '58', '60'], 'megaball': '10', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '3/14/2017', 'balls': ['16', '23', '28', '33', '59'], 'megaball': '13', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '3/10/2017', 'balls': ['26', '38', '42', '58', '70'], 'megaball': '5', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '3/7/2017', 'balls': ['3', '30', '45', '53', '68'], 'megaball': '11', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '3/3/2017', 'balls': ['14', '26', '39', '48', '51'], 'megaball': '9', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '2/28/2017', 'balls': ['20', '33', '45', '58', '69'], 'megaball': '4', 'megaplier': '2', 'details': 'Details'}, {'draw_date': '2/24/2017', 'balls': ['12', '29', '33', '42', '68'], 'megaball': '14', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '2/21/2017', 'balls': ['9', '21', '30', '32', '75'], 'megaball': '9', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '2/17/2017', 'balls': ['4', '56', '58', '67', '75'], 'megaball': '8', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '2/14/2017', 'balls': ['7', '11', '33', '60', '68'], 'megaball': '15', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '2/10/2017', 'balls': ['32', '39', '51', '62', '75'], 'megaball': '14', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '2/7/2017', 'balls': ['23', '28', '37', '56', '71'], 'megaball': '12', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '2/3/2017', 'balls': ['3', '6', '29', '30', '64'], 'megaball': '3', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '1/31/2017', 'balls': ['3', '14', '27', '62', '72'], 'megaball': '4', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '1/27/2017', 'balls': ['17', '37', '53', '54', '61'], 'megaball': '8', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '1/24/2017', 'balls': ['8', '42', '54', '63', '67'], 'megaball': '11', 'megaplier': '4', 'details': 'Details'}, {'draw_date': '1/20/2017', 'balls': ['7', '9', '24', '41', '53'], 'megaball': '14', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '1/17/2017', 'balls': ['20', '31', '54', '56', '59'], 'megaball': '3', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '1/13/2017', 'balls': ['10', '44', '58', '74', '75'], 'megaball': '11', 'megaplier': '3', 'details': 'Details'}, {'draw_date': '1/10/2017', 'balls': ['11', '20', '40', '41', '59'], 'megaball': '15', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '1/6/2017', 'balls': ['6', '10', '44', '47', '54'], 'megaball': '6', 'megaplier': '5', 'details': 'Details'}, {'draw_date': '1/3/2017', 'balls': ['14', '16', '23', '49', '53'], 'megaball': '12', 'megaplier': '2', 'details': 'Details'}]

答案 1 :(得分:2)

如果您希望使用能够处理动态生成的项目的新发布的Requests_HTML库,则您不需要任何浏览器模拟器。鉴于此,您可以尝试如下:

import requests_html

URL = "http://www.megamillions.com/winning-numbers/search?startDate=1/1/2017&endDate=3/31/2018"

with requests_html.HTMLSession() as session:
    r = session.get(URL)
    r.html.render(sleep=5)
    table = r.html.find("table#table", first=True)
    for items in table.find("tr")[2:]:
        data = [item.text for item in items.find("th,td")[:-1]]
        print(data)

部分输出:

['Draw Date', 'Balls', 'Mega Ball', 'Megaplier']
['12/29/2017', '4', '10', '18', '28', '62', '7', '2']
['12/26/2017', '10', '12', '20', '38', '41', '25', '4']
['12/22/2017', '1', '20', '30', '33', '42', '16', '4']
['12/19/2017', '28', '37', '39', '42', '58', '2', '3']
['12/15/2017', '4', '12', '36', '44', '57', '19', '4']
['12/12/2017', '8', '23', '24', '25', '27', '9', '2']
['12/8/2017', '6', '37', '46', '60', '70', '24', '2']
['12/5/2017', '14', '15', '37', '42', '67', '22', '4']
['12/1/2017', '16', '22', '40', '41', '59', '8', '4']
['11/28/2017', '10', '17', '47', '51', '61', '5', '2']
['11/24/2017', '16', '36', '54', '61', '64', '22', '3']
['11/21/2017', '3', '7', '22', '27', '50', '3', '3']
['11/17/2017', '3', '26', '55', '58', '70', '15', '4']

答案 2 :(得分:0)

即使表是动态加载的,您也可以使用requests模块获取其内容。转到开发人员工具中XHR标签中的Network。发送到http://www.megamillions.com/Media/Static/winning-numbers/winning-numbers.json的AJAX请求以JSON的形式返回所需的所有数据。

要获取JSON,您可以使用:

import requests

r = requests.get('http://www.megamillions.com/Media/Static/winning-numbers/winning-numbers.json')
data = r.json()

正如您所看到的那样,JSON采用以下格式(从今天开始的日期):

{'nextDraw': {'IsPending': False,
              'JackpotAnnuityAmount': 345000000,
              'JackpotCashAmount': 206500000,
              'MegaBall': 11,
              'Megaplier': 3,
              'NextDrawDate': '2018-03-21T03:00:00',
              'NextJackpotAnnuityAmount': 377000000,
              'NextJackpotCashAmount': 225700000},
 'numbersList': [{'DrawDate': '2018-03-16T00:00:00',
                  'GameName': 'MegaMillions',
                  'MegaBall': 11,
                  'Megaplier': 3,
                  'WhiteBall1': 26,
                  'WhiteBall2': 52,
                  'WhiteBall3': 33,
                  'WhiteBall4': 1,
                  'WhiteBall5': 13},
                  ...
                  ...

您可以通过以下方式从data变量中获取所需内容:

for item in data['numbersList']:
    date = item['DrawDate']
    megaball = item['MegaBall']
    megaplier = item['Megaplier']
    # and similarly other items
    print(date, megaball, megaplier)

部分输出:

2018-03-16T00:00:00 11 3
2018-03-13T00:00:00 17 5
2018-03-09T00:00:00 22 4
2018-03-06T00:00:00 22 5
2018-03-02T00:00:00 8 4
2018-02-27T00:00:00 23 3
2018-02-23T00:00:00 9 4
2018-02-20T00:00:00 14 3
...
...
相关问题