使用beautifulsoup python在h3和p标签中刮取文本

时间:2019-06-06 13:28:18

标签: python beautifulsoup web-crawler

我有使用Python,BeautifulSoup的经验,但是我很想从网站上抓取数据并将其存储为csv文件。我需要的单个数据样本编码如下(一行数据)。

       ...body and not nested divs...       
            <h3 class="college">
                <span class="num">1.</span> <a href="https://www.stanford.edu/">Stanford University</a>
                </h3>
                <div class="he-mod" data-block="paragraph-9"></div>
                <p class="school-location">Stanford, CA</p>
             ...body and not nested divs...
        <h3 id="MIT" class="college">
        <span class="num">2.</span> <a href="https://web.mit.edu/">Massachusetts Institute of Technology (MIT)</a>
        </h3>
        <div class="he-mod" data-block="paragraph-14"></div>
        <p class="school-location">Cambridge, MA</p>
...body and not nested divs... 
    <h3 id="Berkeley" class="college">
    <span class="num">3.</span> <a href="https://www.berkeley.edu/">University of California Berkeley</a>
    </h3>
    <div class="he-mod" data-block="paragraph-19"></div>
    <p class="school-location">Berkeley, CA</p>
...body and not nested divs... 

我希望使用h3获得链接和名称,并在

中获得文本(我可以这样做,但第一部分不能这样做) 但是用我的代码,我只能找到斯坦福,尽管我是find_all(class _ ='colleges')

我的代码

    import requests
from bs4 import BeautifulSoup



page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')


college_name_list = soup.find(class_='college')
college_name_list_items = college_name_list.find_all('a')
for college_name in college_name_list_items:
    print(college_name.prettify())

输出

<a href="https://www.stanford.edu/">
    Stanford University
   </a>

我也希望获得其他具有相同class = college但ID不同的大学

请帮助我获取它们;我可以自己安排.csv。

Source Website to be Scraped,如果可以的话,请告诉我我应该看什么div / class或其他内容!

3 个答案:

答案 0 :(得分:1)

尝试使用带有<h3>标签的find_all,然后找到<a>,然后提取texthref值。

import requests
from bs4 import BeautifulSoup

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')

college_name=[]

college_name_list = soup.find_all('h3',class_='college')
for college in college_name_list:
  if college.find('a'):
    college_name.append(college.find('a')['href'])
    college_name.append(college.find('a').text)

print(college_name)

输出:以列表格式。

['https://www.stanford.edu/', 'Stanford University', 'https://web.mit.edu/', 'Massachusetts Institute of Technology (MIT)', 'https://www.berkeley.edu/', 'University of California Berkeley', 'https://www.harvard.edu/', 'Harvard University', 'https://www.princeton.edu/', 'Princeton University', '/pennsylvania-education/carnegie-mellon-university-online/', 'Carnegie Mellon University', 'https://www.utexas.edu/', 'The University of Texas at Austin', 'https://www.cornell.edu/', 'Cornell University', 'https://www.ucla.edu/', 'University of California, Los Angeles (UCLA)', '/california-education/university-southern-california-online/', 'University of Southern California', 'https://www.caltech.edu/', 'California Institute of Technology (Caltech)', 'https://www.utoronto.ca/', 'University of Toronto', 'https://illinois.edu/', 'University of Illinois at Urbana-Champaign', 'https://ucsd.edu/', 'University of California in San Diego', 'https://www.umich.edu/', 'University of Michigan', 'https://www.umd.edu/', 'University of Maryland, College Park', 'https://www.ethz.ch/en.html', 'Swiss Federal Institute of Technology', 'https://www.technion.ac.il/en/home-2/', 'Technion-Israel Institute of Technology', 'https://www.osu.edu/', 'Ohio State University', 'https://english.tau.ac.il/', 'Tel Aviv University', '/indiana-education/purdue-university-online/', 'Purdue University', 'https://www.gatech.edu/', 'Georgia Institute of Technology', 'https://www.cam.ac.uk/', 'University of Cambridge', 'https://www.ntu.edu.tw/english/', 'National Taiwan University', 'http://ac.cs.tsinghua.edu.cn', 'Tsinghua University', 'https://www.imperial.ac.uk/', 'The Imperial College of Science, Technology, and Medicine', 'https://www.kau.edu.sa/home_english.aspx', 'King Abdulaziz University', 'https://www.tum.de/en/homepage/', 'Technical University Munich', 'https://uci.edu/', 'University of California, Irvine', 'https://www.ucdavis.edu/', 'University of California, Davis', 'https://www.columbia.edu/', 'Columbia University', '/online-colleges/arizona-state-university-online/', 'Arizona State University', 'https://www.ntu.edu.sg/Pages/home.aspx', 'Nanyang Technological University', 'https://www.ox.ac.uk/', 'University of Oxford', '/online-colleges/northwestern-university-online/', 'Northwestern University', 'https://www.epfl.ch/en/home/', 'Swiss Federal Institute of Technology Lausanne', 'https://www.nyu.edu/', 'New York University', 'https://www.kau.edu.sa/home_english.aspx', 'The Chinese University of Hong Kong', '/north-carolina-education/university-north-carolina-online/', 'University of North Carolina at Chapel Hill', 'https://www.ust.hk/', 'The Hong Kong University of Science and Technology', 'https://twin-cities.umn.edu/', 'University of Minnesota, Twin Cities', 'https://www.zju.edu.cn/english/', 'Zhejiang University', 'https://www.ugr.es/en/', 'University of Granada', 'https://www.ucl.ac.uk/', 'University College London', 'https://www.cityu.edu.hk/', 'City University of Hong Kong', 'https://www.ubc.ca/', 'University of British Columbia', 'https://www.nd.edu/', 'University of Notre Dame', 'http://www.nus.edu.sg/', 'The National University of Singapore', 'http://en.sjtu.edu.cn/', 'Shanghai Jiao Tong University', 'https://www.yale.edu/', 'Yale University', 'https://www.washington.edu/', 'University of Washington', '/north-carolina-education/duke-university-online/', 'Duke University', 'https://www.colorado.edu/', 'University of Colorado at Boulder', 'https://www.ku.dk/english/', 'University of Copenhagen', 'https://www.ucsb.edu/', 'University of California, Santa Barbara', 'https://www.manchester.ac.uk/', 'University of Manchester', 'https://newbrunswick.rutgers.edu/', 'Rutgers University', 'https://www.rice.edu/', 'Rice University', 'https://www.kuleuven.be/english/', 'KU Leuven', 'https://www.utah.edu/', 'University of Utah', 'https://msu.edu/', 'Michigan State University', 'https://www.tamu.edu/', 'Texas A&M University', 'http://english.pku.edu.cn/', 'Peking University', 'https://www.psu.edu/', 'Pennsylvania State University - University Park', 'https://www.udel.edu/', 'University of Delaware', 'http://en.xjtu.edu.cn/', 'Xian Jiao Tong University', 'http://english.hust.edu.cn/', 'Huazhong University of Science and Technology', 'http://en.hit.edu.cn/', 'Harbin Institute of Technology', 'https://www.sfu.ca/', 'Simon Fraser University', 'https://www.polyu.edu.hk/web/en/home/', 'The Hong Kong Polytechnic University', 'https://www.tue.nl/en/', 'Eindhoven University of Technology', 'https://www.nctu.edu.tw/index.php/en', 'National Chiao Tung University', 'https://en.xidian.edu.cn/', 'Xidian University', 'https://www.ujaen.es/serv/vicint/home/index', 'University of Jaen', 'https://www.kaust.edu.sa/en', 'King Abdullah University of Science and Technology', 'https://www.jhu.edu/', 'Johns Hopkins University', 'https://www.upenn.edu/', 'University of Pennsylvania', 'https://www.wisc.edu/', 'University of Wisconsin', 'https://www.ed.ac.uk/home', 'The University of Edinburgh', 'https://www.mcgill.ca/', 'McGill University', 'https://www.bristol.ac.uk/', 'University of Bristol', 'https://new.huji.ac.il/en', 'The Hebrew University of Jerusalem', 'https://www.ugent.be/en', 'Ghent University', 'https://www.brown.edu/', 'Brown University', 'https://www.weizmann.ac.il/pages/', 'Weizmann Institute of Science', 'https://www.unsw.edu.au/', 'University of New South Wales', 'https://www.ualberta.ca/', 'University of Alberta', 'https://www.southampton.ac.uk/', 'University of Southampton', 'https://www.dtu.dk/english', 'Technical University of Denmark', 'https://en.uniroma1.it/', 'Sapienza University of Rome', 'https://en.ustc.edu.cn/', 'The University of Science and Technology of China', 'https://www.uic.edu/', 'University of Illinois at Chicago', 'https://www.hku.hk/', 'University of Hong Kong', 'https://uwaterloo.ca/', 'University of Waterloo', 'https://www.kaist.edu/html/en/', 'Korea Advanced Institute of Science and Technology', 'https://www.uh.edu/', 'University of Houston', 'http://en.dlut.edu.cn/', 'Dalian University of Technology', 'https://en.whu.edu.cn/', 'Wuhan University', '/online-colleges/new-jersey-institute-technology-online/', 'New Jersey Institute of Technology']

不过,您可以使用pandas dataframe并将所有数据导入为csv格式。

要安装熊猫,您只需通过命令行即可运行。

  

pip安装熊猫

并使用以下代码。

import requests
from bs4 import BeautifulSoup
import pandas as pd

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')
college_name=[]
college_name_url=[]
college_name_list = soup.find_all('h3',class_='college')
for college in college_name_list:
  if college.find('a'):
    college_name_url.append(college.find('a')['href'])
    college_name.append(college.find('a').text)
df = pd.DataFrame({"college_name":college_name,"college_name_url":college_name_url})
df.to_csv('college_name.csv')

您的csv文件将是这样。

Csv output

答案 1 :(得分:0)

请尝试以下代码:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')college_name_list = soup.find_all(class_='college')
college_name_list_items =[]
for i in college_name_list:
    college_name_list_items.append(i.find_all('a'))


for college_name in college_name_list_items:
    print(college_name)

答案 2 :(得分:0)

您需要使用class =“ college”获得h3:

import requests

list_colleges = {}

result = requests.get('https://www.stanford.edu/')
if (result.status_code == 200):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(result.content)
    colleges = soup.findAll('h3', {'class': 'colleges'})
    for college in colleges:
        id_college = college.get('id')
        if not (id_college is None):
            list_colleges[id] = college # Store the inner html