使用BS4抓取和解析<script>标签(或者有更好的方法)

时间:2019-06-10 16:56:47

标签: python web-scraping

我正在尝试从https://www.brewbound.com/breweries网站上抓取一家啤酒厂的列表,包括其经度和纬度。这就是我对网站感兴趣的部分源代码的样子:

    <script>
var locations = [['Wolf Pack Brewing Company', 44.6620529, -111.0994608, '/breweries/Wolf_Pack_Brewing_Co'],['Defiant Brewing Company', 41.0584046, -74.022847, '/breweries/Defiant_Brewing_Co'],

,并附有啤酒厂的清单。每个啤酒厂都在[]之间列出,名称分别为lat,long和website。我想要做的是刮除var locations并从中创建一个DataFrame,每个啤酒厂将其作为一行,并在其中包含列出的信息。

我已经能够使用<script>标签(包括页面的多个部分)从网站上抓取所有内容。我不确定从那里去哪里。

    url = "https://www.brewbound.com/breweries"
    r = requests.get(url)
    html_contents = r.text
    html_soup = BeautifulSoup(html_contents, 'html.parser')
    script = html_soup.find_all('script')

这是我为获取所有<script>标签而编写的代码。

1 个答案:

答案 0 :(得分:0)

BeautifulSoup不会帮助您使用<script>标签的内容。但是,您可以使用reast.literal_eval提取信息:

import re
import requests
from ast import literal_eval
from pprint import pprint

url = "https://www.brewbound.com/breweries"
r = requests.get(url)

l = literal_eval(re.search(r'var locations = (\[.*?\]);', r.text, flags=re.DOTALL)[1])
pprint(l)

打印:

[['Wolf Pack Brewing Company',
  44.6620529,
  -111.0994608,
  '/breweries/Wolf_Pack_Brewing_Co'],
 ['Defiant Brewing Company',
  41.0584046,
  -74.022847,
  '/breweries/Defiant_Brewing_Co'],
 ['El Toro Brewing Company',
  37.1465525,
  -121.6219873,
  '/breweries/El_Toro_Brewing_Co'],
 ['Sebago Brewing Company',
  43.679212,
  -70.396424,
  '/breweries/Sebago_Brewing_Co'],

...etc.