试图在python中将项添加到列表中

时间:2018-04-30 12:40:17

标签: python

我试图使用Beautifulsoup从网站收集链接。

from bs4 import BeautifulSoup
import requests

address="http://transcripts.cnn.com/TRANSCRIPTS/2018.04.29.html"
page = requests.get(address)
soup = BeautifulSoup(page.content, 'html.parser')

articles =[]
for links in soup.find_all('div', {'class':'cnnSectBulletItems'}):
    for link in soup.find_all('a'):
        article = link.get('href')
        articles.append(article)
        print(article)

enter image description here

有两个问题:

  1. 存在重复的链接
  2. print命令表示代码找到了链接,但列表文章不包含任何元素。
  3. 有没有人知道发生了什么?

2 个答案:

答案 0 :(得分:1)

您可以使用Set(没有重复元素的无序集合)删除重复链接。

for links in soup.find_all('div', {'class':'cnnSectBulletItems'}):
    links = set(links.find_all('a'))
    for link in links:
        print(link.get('href')) 

答案 1 :(得分:-1)

尝试:

from bs4 import BeautifulSoup
soup = BeautifulSoup(s, 'html.parser')
articles =[]
for links in soup.find_all('div', {'class':'cnnSectBulletItems'}):
    for link in links.find_all('a'):    #-->Fetch Values from links instead of soup
        print link.get('href')
        articles.append(link.get('href'))
print(articles)

<强>输出:

/TRANSCRIPTS/1804/29/cnr.21.html
/TRANSCRIPTS/1804/29/cnr.22.html
/TRANSCRIPTS/1804/29/cnr.03.html
/TRANSCRIPTS/1804/29/rs.01.html
/TRANSCRIPTS/1804/29/ndaysun.02.html
/TRANSCRIPTS/1804/29/sotu.01.html
[u'/TRANSCRIPTS/1804/29/cnr.21.html', u'/TRANSCRIPTS/1804/29/cnr.22.html', u'/TRANSCRIPTS/1804/29/cnr.03.html', u'/TRANSCRIPTS/1804/29/rs.01.html', u'/TRANSCRIPTS/1804/29/ndaysun.02.html', u'/TRANSCRIPTS/1804/29/sotu.01.html']