RegEx用于使用Python捕获HTML文本

时间:2019-05-25 07:14:52

标签: python html regex

我试图使用RegEx从网站上截取文本段落以放入Python列表中,但是对于这个特定网站,我很难格式化RegEx来捕获所有事件。 任何人都可以帮助收集所有实例的结果吗? 或者至少告诉我这是否不可行,我会找到一个替代网站。

from re import *
from urllib.request import urlopen

## Create Empty List
EventInfoListBEC = []

## Asign Website to a Variable
WebsiteBEC = 'https://www.brisent.com.au/Event-Calendar'

## Search for Event Info
EventInfoBEC = findall('<p class="event-description">(.+?)</p>', WebsiteBEC)

## Add Event Info to Event Info List and Print Details
print('Event Info appears', len(EventInfoBEC), 'times (BEC).')
for EventInfo in EventInfoBEC:
    EventInfoListBEC.append(EventInfo)
print(EventInfoListBEC)

## There are Three Styles of Input from the HTML File
# One
<p class="event-description"><p>This is a sport where 8 seconds can cost you everything. Welcome to the world of the PBR.</p>

</p>

# Two
<p class="event-description"><p style="text-align: justify; color: rgb(0, 0, 0); font-family: sans-serif; font-size: 12px;">Fresh off the back of winning a Brit Award for &lsquo;British Artist Video of the Year&rsquo; for &lsquo;Woman Like Me&rsquo;, and two Global Awards for &lsquo;Best Group&rsquo; and &lsquo;Best Song&rsquo;; pop superstars Little Mix today announce that five new Australian shows have been added to &#39;LM5 - The Tour&#39; for 2019!</p>

</p>

#Three
<p class="event-description"><p style="font-family: sans-serif; font-size: 12px; color: rgb(0, 0, 0); text-align: center;"><strong>OPENING NIGHT PERFORMANCE ADDED!</strong></p>



<p style="font-family: sans-serif; font-size: 12px; color: #000000; text-align: justify;">The world&rsquo;s most beloved movie-musical comes to life on the arena stage&nbsp;like you&rsquo;ve never seen it before! From the producers of GREASE - THE ARENA EXPERIENCE comes this lavish new arena production of THE WIZARD OF OZ.</p>

1 个答案:

答案 0 :(得分:0)

许多人指出,有比使用正则表达式更好的方法:我喜欢使用lxmllxml.html),但bs4也可以。

无论如何,这是使用模块regex的解决方案(在此模块中,后视可以具有与re不同的长度)。该解决方案依赖于正则表达式

(?<=class="event-description"><p[\w\s\#\;\(\)\"\=\:\-\,]*>).*(?=</p>)

捕获event-description类中段落的内容。定制组[\w\s\#\;\(\)\"\=\:\-\,]包含样式参数中使用的所有字符。最后,开头*也允许匹配空样式。

# import regex
# import requests

# Asign Website to a Variable
WebsiteBEC = 'https://www.brisent.com.au/Event-Calendar'

# Get source code
req = requests.get(WebsiteBEC, timeout=5)
source_code = req.text

# Extract data
EventInfoBEC = regex.findall(r'(?<=class="event-description"><p[\w\s\#\;\(\)\"\=\:\-\,]*>).*(?=</p>)', source_code)
# ['This is a sport where 8 seconds can cost you everything. Welcome to the world of the PBR.',
#  'See fearless Moana with demigod Maui, follow Dory through the Pacific Ocean, join the Toy Story pals on an exciting adventure and discover true love with Elsa and Anna. Buckle in for the emotional rollercoaster of Inside Out and &ldquo;Live Your Story&rdquo; alongside Disney Princesses as they celebrate their favourite Disney memories!',
#  'Fresh off the back of winning a Brit Award for &lsquo;British Artist Video of the Year&rsquo; for &lsquo;Woman Like Me&rsquo;, and two Global Awards for &lsquo;Best Group&rsquo; and &lsquo;Best Song&rsquo;; pop superstars Little Mix today announce that five new Australian shows have been added to &#39;LM5 - The Tour&#39; for 2019!',
#  '<strong>OPENING NIGHT PERFORMANCE ADDED!</strong>',
#  '<strong>THIRD SHOW ANNOUNCED - ON SALE FROM 2PM FRI 1 FEB!</strong>',
#  '<strong>COMING TO AUSTRALIA FOR THE VERY FIRST TIME.&nbsp;</strong>',
#  'WWE LIVE is returning to Australia!&nbsp;Fans will be able to see their favorite WWE Superstars for the first time since last year&rsquo;s incredible Super Show-Down',
#  '<strong>SHAWN MENDES ANNOUNCES RUEL AS SPECIAL GUEST + ADDITIONAL TICKETS AVAILABLE FOR ALL SHOWS!</strong>',
#  'Steve Martin and Martin Short will bring their critically acclaimed comedy tour Now You See Them, Soon You Won&rsquo;t for the first time to Australian audiences in November.&nbsp;',
#  'After an epic and storied 45-year career that launched an era of rock n roll legends, KISS announced that they will launch their final tour ever in 2019, appropriately named END OF THE ROAD.',
#  '<strong>ELTON JOHN ANNOUNCES 3RD BRISBANE SHOW!</strong>']

仍然需要处理结果以摆脱<strong>标签。另外,上面提供的源代码中的最后一行也不属于event-description类,因此正则表达式不会捕获它。