如何使用bs4摆脱文本上方的空白

时间:2016-05-14 12:05:36

标签: python python-3.x parsing python-requests bs4

好的,所以我使用bs4(BeautifulSoup)来解析一个网站并找到我想要的特定标题。我的代码如下所示:

import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
    if i.a:
        print(i.a.text.replace('\n', '').strip())
    else:
        print(i.contents[0].strip())

此代码有效,但在输出中,在从网站打印请求的标题之前,它首先显示20行空格。我的代码有什么问题,或者我可以做些什么来摆脱空白?

1 个答案:

答案 0 :(得分:0)

因为你有这样的元素:

<article class="article-short">
<div class="thumb"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather"><img alt="FILE: Boys who have undergone a circumcision ceremony walk near Qunu in the Eastern Cape in 2013. Picture: AFP." height="147" src="http://ewn.co.za/cdn/-%2fmedia%2f3C37CB28056746CD95FC913757AAD41C.ashx%3fas%3d1%26h%3d147%26w%3d234%26crop%3d1;waeb9b8157b3e310df" width="234"/></a></div>
<h6 class="h6-mega"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather">Contralesa against scrapping initiation due to cold weather</a></h6>
</article>

其中第一个链接包含图像而没有文本。

您应该寻找h6标签。所以,这样的工作:

import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
    title = (i.h6.text.replace('\n', '') if i.h6 else contents[0]).strip()
    if title:
        print(title)