Question

好的，所以我尝试制作一个脚本（为了我自己的娱乐），它将查看Kayak.co.uk查询的结果并使用python脚本输出它。我正在使用urllib来获取网页查询结果的内容（example = https://www.kayak.co.uk/flights/DUB-LAX/2018-06-04/2018-06-25/2adults?sort=bestflight_a）。但是，我需要一个正则表达式来查找£的价格。我没有多尝试（因为我不太擅长正则表达式）。还有urllib检索JS和HTML吗？我知道我需要的一些信息包含在JS中。任何帮助将不胜感激。

这是我到目前为止所做的：

def urlRead(url):
    """Gets and returns the content of the chosen URL"""
    webpage = urllib.request.urlopen(url) 
    page_contents = webpage.read() 
    return page_contents
def getPrices(content):
    content = re.findall(r'£435', content.decode())
    print(content)

def main():
    page_contents = ''
    url = input('Please enter in the kayak url!: ')
    content = urlRead(url)
    getPrices(content)


if __name__ == '__main__':
    main()

Answer 1

如@Mr Lister所述，如果可以避免，则不应尝试使用正则表达式解析HTML。 Beautiful Soup是一个HTML解析库，可以帮助您完成所需的操作：

response = urllib2.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
html = response.read()
soup = BeautifulSoup(html, "lxml")
aaplPrice = soup.find(id='price-panel').div.span.span.text
aaplVar = soup.find(id='price-panel').div.div.span.find_all('span')[1].string.split('(')[1].split(')')[0]
aapl = aaplPrice + ' ' + aaplVar

用于Web内容的Python正则表达式

1 个答案: