无法使用Beautiful Soup刮取网站

时间:2018-05-07 07:49:18

标签: python web-scraping beautifulsoup

我按照here关于使用Python和BeautifulSoup抓取网站的教程。我试图从我的政府中搜索网站(用于研究目的),但它给我这样的错误: 回溯(最近一次调用最后一次):

File "C:/Python27/scrap web.py", line 8, in <module>
    name = name_box.text.strip()
AttributeError: 'NoneType' object has no attribute 'text'

我尝试了另一个像this这样的网站,但它确实有效。当我查看我的政府网站并使用&#34;查看页面来源&#34;时,我看不到像<table id="tableLeftBottom">这样的代码。那么,我如何从这个网站上废弃数据?

import urllib2
from bs4 import BeautifulSoup
quote_page = "https://bps.go.id/linkTableDinamis/view/id/1116"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")

name_box = soup.find("table", attrs={"id": "tableRightBottom"})
name = name_box.text.strip()
print name

3 个答案:

答案 0 :(得分:1)

要从该页面获取数据,您需要向此url发出post请求以及必要的参数,或者您可以尝试使用任何浏览器模拟器。但是,第一个选项很容易。以下是使用post请求获取数据的方法:

import requests
from bs4 import BeautifulSoup

URL = "https://bps.go.id/mod/Layout/variabelView.php"
payload = "valueDataSelect=**98+--+189+--+102+--+63+--+9818910263+--+1**98+--+190+--+102+--+63+--+9819010263+--+1**98+--+191+--+102+--+63+--+9819110263+--+1**98+--+189+--+105+--+63+--+9818910563+--+1**98+--+190+--+105+--+63+--+9819010563+--+1**98+--+191+--+105+--+63+--+9819110563+--+1**98+--+189+--+107+--+63+--+9818910763+--+1**98+--+190+--+107+--+63+--+9819010763+--+1**98+--+191+--+107+--+63+--+9819110763+--+1**98+--+189+--+108+--+63+--+9818910863+--+1**98+--+190+--+108+--+63+--+9819010863+--+1**98+--+191+--+108+--+63+--+9819110863+--+1**98+--+189+--+109+--+61+--+9818910961+--+1**98+--+190+--+109+--+61+--+9819010961+--+1**98+--+191+--+109+--+61+--+9819110961+--+1**98+--+189+--+110+--+61+--+9818911061+--+1**98+--+190+--+110+--+61+--+9819011061+--+1**98+--+191+--+110+--+61+--+9819111061+--+1**98+--+189+--+111+--+61+--+9818911161+--+1**98+--+189+--+111+--+62+--+9818911162+--+1**98+--+190+--+111+--+61+--+9819011161+--+1**98+--+190+--+111+--+62+--+9819011162+--+1**98+--+191+--+111+--+61+--+9819111161+--+1**98+--+191+--+111+--+62+--+9819111162+--+1**98+--+189+--+112+--+61+--+9818911261+--+1**98+--+189+--+112+--+62+--+9818911262+--+1**98+--+190+--+112+--+61+--+9819011261+--+1**98+--+190+--+112+--+62+--+9819011262+--+1**98+--+191+--+112+--+61+--+9819111261+--+1**98+--+191+--+112+--+62+--+9819111262+--+1**98+--+189+--+113+--+61+--+9818911361+--+1**98+--+189+--+113+--+62+--+9818911362+--+1**98+--+190+--+113+--+61+--+9819011361+--+1**98+--+190+--+113+--+62+--+9819011362+--+1**98+--+191+--+113+--+61+--+9819111361+--+1**98+--+191+--+113+--+62+--+9819111362+--+1**98+--+189+--+114+--+61+--+9818911461+--+1**98+--+189+--+114+--+62+--+9818911462+--+1**98+--+190+--+114+--+61+--+9819011461+--+1**98+--+190+--+114+--+62+--+9819011462+--+1**98+--+191+--+114+--+61+--+9819111461+--+1**98+--+191+--+114+--+62+--+9819111462+--+1**98+--+189+--+115+--+61+--+9818911561+--+1**98+--+189+--+115+--+62+--+9818911562+--+1**98+--+190+--+115+--+61+--+9819011561+--+1**98+--+190+--+115+--+62+--+9819011562+--+1**98+--+191+--+115+--+61+--+9819111561+--+1**98+--+191+--+115+--+62+--+9819111562+--+1**98+--+189+--+116+--+61+--+9818911661+--+1**98+--+189+--+116+--+62+--+9818911662+--+1**98+--+190+--+116+--+61+--+9819011661+--+1**98+--+190+--+116+--+62+--+9819011662+--+1**98+--+191+--+116+--+61+--+9819111661+--+1**98+--+191+--+116+--+62+--+9819111662+--+1**98+--+189+--+117+--+61+--+9818911761+--+1**98+--+189+--+117+--+62+--+9818911762+--+1**98+--+190+--+117+--+61+--+9819011761+--+1**98+--+190+--+117+--+62+--+9819011762+--+1**98+--+191+--+117+--+61+--+9819111761+--+1**98+--+191+--+117+--+62+--+9819111762+--+1&wilayahDataSelect=1%23%23~2~3~4~5~6~7~8~9~10~11~12~13~14~15~16~17~18~19~20~21~22~23~24~25~26~27~28~29~30~31~32~33~34~35~1%40%40%24%24%24%40%40&keteranganDataSelect=**Gini+Rasio++--+Perkotaan+--+2002+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2002+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2002+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2005+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2005+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2005+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2007+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2007+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2007+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2008+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2008+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2008+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2009+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2009+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2009+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2010+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2010+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2010+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2011+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2011+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2011+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2011+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2011+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2011+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2012+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2012+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2012+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2012+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2012+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2012+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2013+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2013+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2013+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2013+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2013+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2013+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2014+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2014+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2014+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2014+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2014+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2014+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2015+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2015+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2015+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2015+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2015+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2015+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2016+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2016+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2016+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2016+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2016+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2016+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2017+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2017+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2017+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2017+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2017+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2017+--+Semester+2+(September)&kirim=3&layout=Var"

with requests.Session() as s:
    s.headers={"User-Agent":"Mozilla/5.0"}
    s.headers.update({'Content-Type': 'application/x-www-form-urlencoded'})
    html = s.post(URL, data = payload).text
    soup = BeautifulSoup(html, "lxml")
    for items in soup.find(id="tableRightBottom").find_all("tr"):
        data = [item.text for item in items.find_all("td")]
        print(data)

输出:

[' - ', '0.332', '0.289', '0.301', '0.291', '0.312', '0.353', '0.370', '0.337', '0.407', '0.404', '0.382', '0.358', '0.380', '0.367', '0.368', '0.343', '0.362', '0.347', '0.334', ' - ', '0.239', '0.257', '0.253', '0.250', '0.261', '0.280', '0.269', '0.271', '0.260', '0.256', '0.254', '0.259', '0.277', '0.292', '0.293', '0.288', '0.296', '0.293', '0.299', ' - ', '0.288', '0.285', '0.290', '0.288', '0.301', '0.326', '0.326', '0.320', '0.341', '0.341', '0.331', '0.325', '0.337', '0.334', '0.339', '0.333', '0.341', '0.329', '0.329']

等等----

答案 1 :(得分:0)

这是因为网站html不包含数据。数据由div内的dataDynamic内的JavaScript呈现。数据来自端点https://bps.go.id/mod/Layout/variabelView.php

如果您想获取数据,可以使用selenium或requests_html。

答案 2 :(得分:0)

首先,您要查找名称为&#34; th&#34;的节点。而且ID是&#34; th2b&#34;。但是这些内容是通过javascript创建的。当你打开网站时,你会看到加载。所以你应该使用&#34;无头浏览器&#34;。

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gup")
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument("--hide-scrollbars")
chrome_options.add_argument('--dns-prefetch-disable')
chrome_options.add_argument("--disable-extensions")
chrome_options.binary_location = "you chrome path"
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.maximize_window()
response = driver.get(quote_page)
time.sleep(10)
page = response.get_body()
soup = BeautifulSoup(page, "html.parser")

name_box = soup.find("th", attrs={"id": "th2b"})
name = name_box.text.strip()
print(name)

你会得到文字。