收集来自Google搜索的链接文本和链接href

时间:2019-03-28 07:59:08

标签: python web-scraping

我试图从Google搜索(仅前10个)中收集链接和链接文本,这是我的代码:

import requests
from lxml import html
import time
import re
headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
sentence = "hello world"
url = 'https://google.com/search?q={}'.format(sentence)
res= requests.get(url, headers=headers)
tree= html.fromstring(res.text)
li = tree.xpath("//a[@href]")
y = [link for link in li if link.get('href').startswith(("https://", "http://")) if "google" not in link.get('href')][:10]
for i in y:
    print("{}:\t{}".format(i.text_content(), i.get('href')))

这是输出:

10
1:56hello world:    https://www.youtube.com/watch?v=Yw6u6YkTgQ4
4:23BUMP OF CHICKEN「Hello,world!」:  https://www.youtube.com/watch?v=rOU4YiuaxAM
5:24Lady Antebellum - Hello World:  https://www.youtube.com/watch?v=al2DFQEZl4M
"Hello, World!" program - Wikipediahttps://en.wikipedia.org/wiki/%22Hello,_World!%22_program:   https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
Hello World (disambiguation):   https://en.wikipedia.org/wiki/Hello_World_(disambiguation)
Sanity check:   https://en.wikipedia.org/wiki/Sanity_check
Just another Perl hacker:   https://en.wikipedia.org/wiki/Just_another_Perl_hacker
Hello, World! - Learn Python - Free Interactive Python Tutorialhttps://www.learnpython.org/en/Hello,_World!:    https://www.learnpython.org/en/Hello,_World!
Hello World Kids: HWKhelloworldkids.org/:   http://helloworldkids.org/
About Us:   http://helloworldkids.org/about-us/

该列表是正确的,但是,有时print时我会得到重复的链接,如何从输出中删除重复的链接

1 个答案:

答案 0 :(得分:0)

您可以使用此代码,我对您的代码进行了一些更改,它将起作用

import requests
from lxml import html
import time
import re
headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
sentence = "hello world"
url = 'https://google.com/search?q={}'.format(sentence)
res= requests.get(url, headers=headers)
tree= html.fromstring(res.text)
li = tree.xpath("//a[@href]")
y = [link for link in li if link.get('href').startswith(("https://", "http://")) if 
"google" not in link.get('href')][:10]

links=[]
for i in y:
    #print("{}:\t{}".format(i.text_content(), i.get('href')))
    if (i.get('href')) not in links:
        links.append( i.get('href') )

for l in links:
   print(l)

列表“链接”将仅包含不同的链接

相关问题