有关使用py bs4进行网页抓取的问题

时间:2019-06-06 11:23:15

标签: python web-scraping beautifulsoup

我正在尝试在网络上刮刮天气数据以学习刮刮基础知识,在网站包含的HTML结构上遇到了一些问题。

我已经调试了html页面中的嵌套结构,可以通过打印出d["precip"]来显示第一个数据,但是我不知道为什么下一个循环无法读取该迭代,因此print(i)仍在进行迭代,可以显示其正常工作。

第一个循环的结果:

{'date': '19:30', 'hourly-date': 'Thu', 
'hidden-cell-sm description': 'Mostly Cloudy', 
'temp': '26°', 'feels': '30°', 'precip': '15%', 
'humidity': '84%', 'wind': 'SSE 12 km/h '}

在第一个循环之后:

{'date': 'None', 'hourly-date': 'None', 
'hidden-cell-sm description': 'None', 
'temp': 'None', 'feels': 'None', 'precip': 'None', 
'humidity': 'None', 'wind': 'None'}

HTML端: 我要剪贴的值是“ 10”和“%”,我是在第一次迭代中完成的,但是我不知道为什么第二次将其变为“无”

<td class="precip" headers="precip" data-track-string="ls_hourly_ls_hourly_toggle" classname="precip">
   <div><span class="icon icon-font iconset-weather-data icon-drop-1" classname="icon icon-font iconset-weather-data icon-drop-1"></span>
      <span class="">
        <span>
          10
          <span class="Percentage__percentSymbol__2Q_AR">
            %
          </span>
        </span> 
      </span>
   </div>
</td>

Python代码

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
content = page.content
soup = BeautifulSoup(content, "html.parser")
total = []
container = []
#all = soup.find("div", {"class": "locations-title hourly-page-title"}).find("h1").text
table = soup.find_all("table", {"class": "twc-table"})
for items in table:
    for i in range(len(items.find_all("tr")) - 1):
        d = {}
        try:
            d["date"] = items.find_all("span", {"class": "dsx-date"})[i].text
            d["hourly-date"] = items.find_all("div", {"class": "hourly-date"})[i].text
            d["hidden-cell-sm description"] = items.find_all("td", {"class": "hidden-cell-sm description"})[i].text
            d["temp"] = items.find_all("td", {"class": "temp"})[i].text
            d["feels"] = items.find_all("td", {"class": "feels"})[i].text

            #issue starts from here
            inclass = items.find_all("td", {"class": "precip"})[i]
            realtext = inclass.find_all("div", "")[i]
            d["precip"] = realtext.find_all("span", {"class": ""})[i].text
            #issue end

            d["humidity"] = items.find_all("td", {"class": "humidity"})[i].text
            d["wind"] = items.find_all("td", {"class": "wind"})[i].text

        except:
            d["date"] = "None"
            d["hourly-date"] = "None"
            d["hidden-cell-sm description"] = "None"
            d["temp"] = "None"
            d["precip"] = "None"
            d["feels"] = "None"
            d["precip"] = "None"
            d["humidity"] = "None"
            d["wind"] = "None"

        total.append(d)

df = pandas.DataFrame(total)
df = df.rename(index=str, columns={"date": "Date", "hourly-date": "weekdays", "hidden-cell-sm description": "Description"})
df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'percip', 'humidity', 'wind'])

我希望删除所有数据,但是如上所述,“密码”丢失了,但其他信息仍然存在。 有关更多信息,这是结果

     Date weekdays    Description temp feels  percip humidity          wind
0   19:30      Thu  Mostly Cloudy  26°   30°     NaN      84%  SSE 12 km/h 
1   20:00      Thu  Mostly Cloudy  26°   30°     NaN      86%  SSE 11 km/h 
2   21:00      Thu  Mostly Cloudy  26°   30°     NaN      86%  SSE 12 km/h 
3   22:00      Thu  Mostly Cloudy  26°   29°     NaN      86%  SSE 12 km/h 
4   23:00      Thu         Cloudy  26°   29°     NaN      87%  SSE 12 km/h 
5   00:00      Fri         Cloudy  26°   29°     NaN      87%    S 12 km/h 
6   01:00      Fri         Cloudy  26°   26°     NaN      88%    S 12 km/h 
7   02:00      Fri         Cloudy  26°   26°     NaN      87%    S 12 km/h 
8   03:00      Fri         Cloudy  29°   35°     NaN      87%    S 12 km/h 
9   04:00      Fri  Mostly Cloudy  29°   35°     NaN      87%    S 12 km/h 
10  05:00      Fri  Mostly Cloudy  28°   35°     NaN      87%  SSW 11 km/h 
11  06:00      Fri  Mostly Cloudy  28°   34°     NaN      88%  SSW 11 km/h 
12  07:00      Fri  Mostly Cloudy  29°   35°     NaN      87%  SSW 10 km/h 
13  08:00      Fri  Mostly Cloudy  29°   36°     NaN      84%  SSW 12 km/h 
14  09:00      Fri  Mostly Cloudy  29°   37°     NaN      82%  SSW 13 km/h 
15  10:00      Fri  Partly Cloudy  30°   37°     NaN      81%  SSW 14 km/h 

在这里的新手,我想学习,请告诉我如何改进我的代码结构。非常感谢

2 个答案:

答案 0 :(得分:1)

您的precip变量一无所获,这就是您得到的结果。要解决此问题,可以使用此类Percentage__percentSymbol__2Q_AR,然后使用它的previous_sibling来提取所需的内容。我试图向您展示您遇到麻烦的以下部分。

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
soup = BeautifulSoup(page.text, "html.parser")
total = []
for tr in soup.find("table",class_="twc-table").tbody.find_all("tr"):
    d = {}
    d["date"] = tr.find("span", class_="dsx-date").text
    d["precip"] = tr.find("span", class_="Percentage__percentSymbol__2Q_AR").previous_sibling
    total.append(d)

df = pandas.DataFrame(total,columns=['date','precip'])
print(df)

答案 1 :(得分:0)

find_all函数总是返回一个列表,strip()是删除字符串开头和结尾的空格。和percipdf = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'percip', 'humidity', 'wind'])中定义了错误的标签,因为您在字典中定义了d["precip"] = "None"

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
content = page.content
soup = BeautifulSoup(content, "html.parser")
total = []
container = []
tables = soup.find_all("table", {"class": "twc-table"})
for table in tables:
    for tr in table.find("tbody").find_all("tr"):
        d = {"date":"None","hourly-date":"None","hidden-cell-sm description":"None","temp":"None","precip":"None",\
             "feels":"None","precip":"None","humidity":"None","wind":"None"}

        for td in tr.find_all("td"):
            try:
                _class = td.get("class")
                if len(_class) > 1:
                    temp = 0
                    for cc in _class:
                        if "cell-hide" in cc:
                            temp+=1
                            break
                    if temp > 0:
                        continue

                if len(_class)>1 and  "description" in _class[1]:
                    d["hidden-cell-sm description"] = td.text.strip()

                elif _class[0] in "temp":
                    d["temp"] = td.text.strip()

                elif "feels" in _class[0]:
                    d["feels"] = td.text.strip()

                elif "precip" in _class[0]:
                    d["precip"] = td.text.strip()

                elif "humidity" in _class[0]:
                    d["humidity"] = td.text.strip()

                elif "wind" in _class[0]:
                    d["wind"] = td.text.strip()

                else:
                    d["date"] = td.find("span", {"class": "dsx-date"}).text.strip()
                    d["hourly-date"] = td.find("div", {"class": "hourly-date"}).text.strip()
            except:
                pass

        total.append(d)

df = pandas.DataFrame(total)
df = df.rename(index=str, columns={"date": "Date", "hourly-date": "weekdays", "hidden-cell-sm description": "Description"})
df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'precip', 'humidity', 'wind'])
print(df)

O / P:

     Date weekdays    Description temp feels precip humidity         wind
0   20:30      Thu  Mostly Cloudy  26°   30°    10%      85%  SSE 12 km/h
1   21:00      Thu  Mostly Cloudy  26°   30°     5%      85%  SSE 12 km/h
2   22:00      Thu  Mostly Cloudy  26°   30°     0%      85%  SSE 12 km/h
3   23:00      Thu  Mostly Cloudy  26°   29°     0%      87%  SSE 12 km/h
4   00:00      Fri         Cloudy  26°   29°     0%      87%    S 12 km/h
5   01:00      Fri         Cloudy  26°   26°     5%      88%    S 12 km/h
6   02:00      Fri         Cloudy  26°   26°    15%      88%    S 12 km/h
7   03:00      Fri  Mostly Cloudy  25°   25°    20%      88%    S 10 km/h
8   04:00      Fri  Mostly Cloudy  25°   29°    25%      88%    S 10 km/h
9   05:00      Fri  Mostly Cloudy  25°   28°    25%      88%  SSW 10 km/h
10  06:00      Fri  Mostly Cloudy  25°   28°    25%      89%  SSW 10 km/h
11  07:00      Fri  Mostly Cloudy  26°   29°    25%      88%  SSW 10 km/h
12  08:00      Fri  Mostly Cloudy  26°   29°    25%      84%  SSW 11 km/h
13  09:00      Fri  Partly Cloudy  27°   30°    25%      82%  SSW 12 km/h
14  10:00      Fri  Partly Cloudy  27°   30°    25%      81%  SSW 14 km/h
15  11:00      Fri  Partly Cloudy  27°   31°    15%      78%  SSW 15 km/h