在Wiki页面中解析多个表

时间:2015-12-09 15:00:54

标签: python python-2.7 csv beautifulsoup

目前我正在尝试解析此Wiki页面上的所有表格。但是,正如您可以通过我的代码告诉我,我只检索一个表。我想抓住所有表并将它们放在适当的列/行中。

下面是我的代码,我对下一步需要做的事情有点失落。

import csv
import urllib 
import requests
import codecs
import re
from bs4 import BeautifulSoup

url = \
    'https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States'

response = requests.get(url)
html = response.content

#remove references Brackets
removeBrackets = re.sub(r'\[.*\]', '', html)
#remove Trailing 0's in numbers
removeTrails = removeBrackets.replace('0,000,001','')

soup = BeautifulSoup(removeTrails)

table = soup.find('table', {'class': 'sortable wikitable'})

# remove all extra tags in the HTML Tables
for div in soup.findAll('span', 'sortkey'):
    div.extract();
for div in soup.findAll('span', 'sorttext'):
    div.extract();

#scan through table
list_of_rows = []
for row in table.findAll('tr')[1:]:
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace(' ', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)
#write 
outfile = open("schoolshootings.csv", "wb")
writer = csv.writer(outfile)
writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row]) 
writer.writerow(["Date", "Location", "Deaths", "Injuries", "Description"])
writer.writerows(list_of_rows)

1 个答案:

答案 0 :(得分:1)

您还需要为表格使用find而不是table = soup.find('table', {'class': 'sortable wikitable'}) 。如果你改变这一行

for table in soup.findAll('table', {'class': 'sortable wikitable'}):

为:

list_of_rows.append(list_of_cells)

并将所有行缩进到list_of_rows = []一个额外的4个空格,它将获得所有其他表。您还需要将.findAll移至.text

已编辑添加

你有一堆你真正不需要的正则表达式,因为它更容易使用span。此外,当您使用sorttext提取span时,请删除您不想要的日期字段。由于我删除了正则表达式,因此我还需要使用display:none

提取url = 'https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States' html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html) list_of_rows = [] for table in soup.findAll('table', {'class': 'sortable wikitable'}): # remove all extra tags in the HTML Tables for div in soup.findAll('span', 'sortkey'): div.extract(); for div in soup.findAll('span', {'style':'display:none'}): div.extract(); #scan through table for row in table.findAll('tr')[1:]: list_of_cells = [] for cell in row.findAll('td'): list_of_cells.append(cell.text) list_of_rows.append(list_of_cells)

以下代码可满足您的需求:

graph.cypher.execute('''
   MERGE (tom:Person {name: "Tom"})
   MERGE (jerry:Person {name: "Jerry"})
   CREATE UNIQUE (tom)-[:KNOWS]->(jerry)
''')