从图书馆目录中搜集信息

时间:2018-05-20 04:25:12

标签: python beautifulsoup screen-scraping

我正在开发一个项目,用于从特定库中获取书籍的目录信息。到目前为止我的脚本可以从表中抓取所有单元格。但是,我对如何只返回新英国图书馆的特定单元格感到困惑。

import requests
from bs4 import BeautifulSoup

mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)

soup = BeautifulSoup(response.text, 'html.parser')

data = []
table = soup.find('table', attrs={'class':'itemTable'})


rows = table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

for index, libraryinfo in enumerate(data):
    print(index, libraryinfo)

以下是脚本中新英国图书馆的示例输出:

["New Britain, Main Library - Children's Department", 'J FIC PALACIO', 'Check Shelf']

我不会返回所有细胞,而是如何返回有关新英国图书馆的细胞?我只想要图书馆名称和结帐状态。

所需的输出是:

["New Britain, Main Library - Children's Department", 'Check Shelf']

可以有多个单元格,因为一本书可以在同一个库中有多个副本。

3 个答案:

答案 0 :(得分:2)

为了简单地根据特定字段(示例中的第一个)过滤掉数据,您可以构建一个理解:

[element for element in data if 'New Britain' in element[0]]

您提供的示例消除了空值,这使得数据元素具有不同的大小。这使得更难以知道哪个字段对应于每个数据组件。使用dicts,我们可以使数据更容易理解,更容易治疗。

其中一些字段似乎有空的块(只有类似空格的字符['\n''\r''\t'' '])。因此剥离不会删除那些。将它与简单的正则表达式相结合可以帮助改善这一点。我写了一个简单的函数来做到这一点:

def squish(s):
    return re.sub(r'\s+', ' ', s)

总结一下,我相信这会对你有所帮助:

import re

import requests
from bs4 import BeautifulSoup


def squish(s):
    return re.sub(r'\s+', ' ', s)


def filter_by_location(data, location_name):
    return [x for x in data if location_name.lower() in x['Location'].lower()]


mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)

soup = BeautifulSoup(response.text, 'html.parser')

data = []
table = soup.find('table', attrs={'class':'itemTable'})

headers = [squish(element.text.strip()) for element in table.find('tr').find_all('th')]

for row in table.find_all('tr')[1:]:
    cols = [squish(element.text.strip()) for element in row.find_all('td')]
    data.append({k:v for k, v in zip(headers, cols)})

filtered_data = filter_by_location(data, 'New Britain')
for x in filtered_data:
    print('Location: {}'.format(x['Location']))
    print('Status: {}'.format(x['Status']))
    print()

运行它我得到以下结果:

Location: New Britain, Jefferson Branch - Children's Department
Status: Check Shelf

Location: New Britain, Main Library - Children's Department
Status: Check Shelf

Location: New Britain, Main Library - Children's Department
Status: Check Shelf

答案 1 :(得分:0)

过滤掉与新英国无关的行只需要检查cols的第一个元素(即cols[0])是否具有库的名称。

仅获取库名称和结帐状态很简单。您只需要访问cols的第一个和第三个元素(即[cols[0], cols[2]]),因为它们分别具有库名和结帐状态。

您可以尝试使用以下内容替换data.append([ele for ele in cols if ele])

# We gotta do this to skip empty rows.
if len(cols) == 0:
    continue

if 'New Britain' in cols[0]:
    data.append([cols[0], cols[2]])

您的代码如下所示:

import requests
from bs4 import BeautifulSoup

mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)

soup = BeautifulSoup(response.text, 'html.parser')

data = []
table = soup.find('table', attrs={'class':'itemTable'})

rows = table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]

    if len(cols) == 0:
        continue

    if 'New Britain' in cols[0]:
        data.append([cols[0], cols[2]])

for index, libraryinfo in enumerate(data):
    print(index, libraryinfo)

输出:

0 ["New Britain, Jefferson Branch - Children's Department", 'Check Shelf']
1 ["New Britain, Main Library - Children's Department", 'Check Shelf']
2 ["New Britain, Main Library - Children's Department", 'Check Shelf']

答案 2 :(得分:0)

尝试此操作以获得所需内容:

import requests
from bs4 import BeautifulSoup

URL = "http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt"

res = requests.get(URL)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.find("table",class_="itemTable").find_all("tr"):
    if "New Britain" in items.text:
        data = items.find_all("td")
        name = data[0].a.get_text(strip=True)
        status = data[2].get_text(strip=True)
        print(name,status)

输出:

New Britain, Jefferson Branch - Children's Department Check Shelf
New Britain, Main Library - Children's Department Check Shelf
New Britain, Main Library - Children's Department Check Shelf