Question

我用python和漂亮的汤写了这个页面刮板来从表中提取数据，现在想要保存它。我刮的区域是网站右侧的表格。我需要左侧的粗体部分对应右侧，因此关键人物对应 ceo 。对此有所了解，需要一些关于格式化的最佳方法的建议。谢谢。

import requests
import csv
from datetime import datetime
from bs4 import BeautifulSoup

# download the page
myurl = requests.get("https://en.wikipedia.org/wiki/Goodyear_Tire_and_Rubber_Company")
# create BeautifulSoup object
soup = BeautifulSoup(myurl.text, 'html.parser')

# pull the class containing all tire name
name = soup.find(class_ = 'logo')
# pull the div in the class
nameinfo = name.find('div')

# just grab text inbetween the div
nametext = nameinfo.text

# print information about goodyear logo on wiki page
#print(nameinfo)

# now, print type of company, private or public
#status  = soup.find(class_ = 'category')
#for link in soup.select('td.category a'):
    #print link.text

# now get the ceo information
#for employee in soup.select('td.agent a'):
    #print employee.text

# print area served
#area = soup.find(class_ = 'infobox vcard')
#print(area)


# grab information in bold on the left hand side
vcard = soup.find(class_ = 'infobox vcard')
rows = vcard.find_all('tr')
for row in rows:
    cols=row.find_all('th')
    cols=[x.text.strip() for x in cols]
    print cols

# grab information in bold on the right hand side
vcard = soup.find(class_ = 'infobox vcard')
rows = vcard.find_all('tr')
for row in rows:
    cols2=row.find_all('td')
    cols2=[x.text.strip() for x in cols2]
    print cols2

# save to csv file named index
with open('index.csv', 'w') as csv_file:
        writer = csv.writer(csv_file) # actually write to the file
        writer.writerow([cols,cols2 , datetime.now()]) # apprend time

Answer 1

您需要稍微重新排序代码。也可以同时找到tr和th，这样可以解决两列需要同步的问题：

import requests
import csv
from datetime import datetime
from bs4 import BeautifulSoup

myurl = requests.get("https://en.wikipedia.org/wiki/Goodyear_Tire_and_Rubber_Company")
soup = BeautifulSoup(myurl.text, 'html.parser')
vcard = soup.find(class_='infobox vcard')

with open('output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output)

    for row in vcard.find_all('tr')[1:]:
        cols = row.find_all(['th', 'td'])
        csv_output.writerow([x.text.strip().replace('\n', ' ').encode('ascii', 'ignore') for x in cols] + [datetime.now()])

这将创建一个output.csv文件，例如：

Type,Public,2018-03-27 17:12:45.146000
Tradedas,NASDAQ:GT S&P 500 Component,2018-03-27 17:12:45.147000
Industry,Manufacturing,2018-03-27 17:12:45.147000
Founded,"August29, 1898; 119 years ago(1898-08-29) Akron, Ohio, U.S.",2018-03-27 17:12:45.147000
Founder,Frank Seiberling,2018-03-27 17:12:45.147000
Headquarters,"Akron, Ohio, U.S.",2018-03-27 17:12:45.148000
Area served,Worldwide,2018-03-27 17:12:45.148000
Key people,"Richard J. Kramer (Chairman, President and CEO)",2018-03-27 17:12:45.148000
Products,Tires,2018-03-27 17:12:45.148000
Revenue,US$ 15.158 billion[1](2016),2018-03-27 17:12:45.149000
Operating income,US$ 1.52 billion[1](2016),2018-03-27 17:12:45.149000
Net income,US$ 1.264 billion[1](2016),2018-03-27 17:12:45.149000
Total assets,US$ 16.511 billion[1](2016),2018-03-27 17:12:45.150000
Total equity,US$ 4.507 billion[1](2016),2018-03-27 17:12:45.150000
Number of employees,"66,000[1](2017)",2018-03-27 17:12:45.150000
Subsidiaries,List of subsidiaries,2018-03-27 17:12:45.151000
Website,goodyear.com,2018-03-27 17:12:45.151000

将数据格式化为csv文件

1 个答案: