抓取内部div的文本

时间:2019-05-28 11:23:19

标签: python-3.x

我是python抓取的新手。为了进行试用,我在Google网页上搜索了“俄亥俄州的城市”。我想抓取结果,即图像框中的城市名称(我只想输入文字)。尽管有很多div,但令我感到困惑的是,该如何使用以文本为城市名称的div。我只想删除哥伦布下写的文字。我想要那个哥伦布。而其他城市的名字也在该区域中。

可以请你学习这个东西吗?

import bs4
import requests
import html.parser
import lxml,
res = requests.get('https://www.google.com/search?rlz=1C1CHBF_enIN818IN818&ei=KejsXJTSLdu0rQGk3aeQDw&q=cities+in+Ohio&oq=cities+in+Ohio&gs_l=psy-ab.3..0i71l8.826656.826656..826671...0.0..0.0.0.......0....2j1..gws-wiz.N2bmaS9Bitw')
soup = bs4.BeautifulSoup(res.text, 'lxml')
type(soup)
<class 'bs4.BeautifulSoup'>
soup.select('.wfg6Pb')[]

输出始终为[]。 请使用代码中的链接获取结果。

1 个答案:

答案 0 :(得分:0)

要为我完成这项工作,我必须做两件事:

  1. 将本地化添加到URL查询参数(hl=en&gl=en)中,否则我得到希伯来语的结果(从以色列冲浪...)
  2. 使用稍微更具体的选择器来识别名称本身(否则我还会得到一些不相关的信息块)

总而言之,我的代码如下:

import bs4
import requests
import html.parser
import lxml
res = requests.get('https://www.google.com/search?hl=en&gl=en&rlz=1C1CHBF_enIN818IN818&ei=KejsXJTSLdu0rQGk3aeQDw&q=cities+in+Ohio&oq=cities+in+Ohio&gs_l=psy-ab.3..0i71l8.826656.826656..826671...0.0..0.0.0.......0....2j1..gws-wiz.N2bmaS9Bitw')
soup = bs4.BeautifulSoup(res.content, 'lxml')
city_divs = soup.select('a.Mlb36b div.s3v9rd')
city_names = [c.text for c in city_divs]

city_divs的输出为:

[<div class="BNeawe s3v9rd AP7Wnd">Columbus</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Cleveland</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Cincinnati</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Dayton</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Toledo</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Akron</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Youngstown</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Findlay</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Kent</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Chillicothe</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Westerville</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Zanesville</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Wooster</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Elyria</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Hilliard</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Grove City</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Lorain</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Massillon</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Perrysburg</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Strongsville</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Cuyahoga Falls</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Maumee</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Reynoldsburg</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Stow</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Port Clinton</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Pickerington</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Steubenville</div>,
 <div class="BNeawe s3v9rd AP7Wnd">North Canton</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Gahanna</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Ashtabula</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Beachwood</div>,
 <div class="BNeawe s3v9rd AP7Wnd">New Philadelphia</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Miamisburg</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Wadsworth</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Bellefontaine</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Painesville</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Put‑in‑Bay</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Worthington</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Twinsburg</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Chagrin Falls</div>,
 <div class="BNeawe s3v9rd AP7Wnd">North Olmsted</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Barberton</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Canal Winchester</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Wright‑Patterson Air Force...</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Yellow Springs</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Shaker Heights</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Oberlin</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Upper Arlington</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Blue Ash</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Rocky River</div>,
 <div class="BNeawe s3v9rd AP7Wnd">Pataskala</div>]

,对于city_names是:

['Columbus',
 'Cleveland',
 'Cincinnati',
 'Dayton',
 'Toledo',
 'Akron',
 'Youngstown',
 'Findlay',
 'Kent',
 'Chillicothe',
 'Westerville',
 'Zanesville',
 'Wooster',
 'Elyria',
 'Hilliard',
 'Grove City',
 'Lorain',
 'Massillon',
 'Perrysburg',
 'Strongsville',
 'Cuyahoga Falls',
 'Maumee',
 'Reynoldsburg',
 'Stow',
 'Port Clinton',
 'Pickerington',
 'Steubenville',
 'North Canton',
 'Gahanna',
 'Ashtabula',
 'Beachwood',
 'New Philadelphia',
 'Miamisburg',
 'Wadsworth',
 'Bellefontaine',
 'Painesville',
 'Put‑in‑Bay',
 'Worthington',
 'Twinsburg',
 'Chagrin Falls',
 'North Olmsted',
 'Barberton',
 'Canal Winchester',
 'Wright‑Patterson Air Force...',
 'Yellow Springs',
 'Shaker Heights',
 'Oberlin',
 'Upper Arlington',
 'Blue Ash',
 'Rocky River',
 'Pataskala']