来自Google的抓取链接

时间:2016-02-02 18:03:59

标签: python google-crawlers

我正在尝试抓取相关领域的链接,即计算机科学,但在我得到一些非常奇怪的输出链接的方式。即使我尝试在网络浏览器中打开这些链接,也会显示找不到的页面。

以下是代码:

from bs4 import BeautifulSoup
import requests

a = input("search:")
page = requests.get("https://www.google.dz/search?q="+a)
soup = BeautifulSoup(page.content)
links = soup.findAll("a")
for link in  links:
    if link['href'].startswith('/url?q='):
        print (link['href'].replace('/url?q=',''),'\n')
      #  f = open('links.txt','a+')
       # f.write(link['href'].replace('/url?q=',''))
       # f.close()

输出:

search:"data"
('http://www.zdnet.fr/actualites/data-lakes-ne-les-confondez-pas-avec-un-data-warehouse-39832052.htm&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQqQIIFTAA&usg=AFQjCNFZzS0E1EDF51VtLq-KWuxvg2HPeg', '\n')
('http://www.journaldugeek.com/2016/02/01/microsoft-planche-sur-des-data-centers-sous-marins/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQqQIIFzAB&usg=AFQjCNGjc0-ev9X5MigD0-mzSx0zr5-6Qw', '\n')
('http://www.01net.com/actualites/microsoft-veut-noyer-vos-donnees-et-ses-data-centers-en-pleine-mer-947974.html&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQqQIIGTAC&usg=AFQjCNEB9fsmDeARKnjwjyfe90bpJwJWcA', '\n')
('http://rmsnews.com/big-data-recrutement-par-jean-christophe-anna/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQwW4IHTAD&usg=AFQjCNEc125DUcwyX9QTCNus0hBRsFS6DA', '\n')
('http://bolin.su.se/data/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQwW4IHzAE&usg=AFQjCNEwuKR9IlFHwCgNQagBZt8NN8M9Iw', '\n')
('http://birt.actuate.com/products/ihub/data-access&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQwW4IITAF&usg=AFQjCNFAGC79QVuHPrw7M9pzzC7Jh_EYSw', '\n')
('http://www.lepoint.fr/technologie/video-le-big-data-jusqu-ou-18-03-2015-1913631_58.php&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQwW4IIzAG&usg=AFQjCNF_j4WlW_axSMjtpiONdh6OjlEaMQ', '\n')
('https://fr.wikipedia.org/wiki/Donn%25C3%25A9e&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFgglMAc&usg=AFQjCNELfR-1pSA9e4KyzDCBx8SVtkMvyg', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:zXVlfFTefbsJ:https://fr.wikipedia.org/wiki/Donn%2525C3%2525A9e%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAgoMAc&usg=AFQjCNGuPAXHAqtRMSB8l7D9DoOFn3Ta4g', '\n')
('https://en.wikipedia.org/wiki/Data&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFggqMAg&usg=AFQjCNHIINpuNGYzYlOWVUb628dcSnownw', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:n6Ofwm3_TzIJ:https://en.wikipedia.org/wiki/Data%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAgtMAg&usg=AFQjCNF8fbDR6kGbFRPBzkz20ZpjXE23JA', '\n')
('https://en.wikipedia.org/wiki/Data_(disambiguation)&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQ0gIILygAMAg&usg=AFQjCNGK7coMxJqmsREt19hEmLWR6QW4Ow', '\n')
('https://en.wikipedia.org/wiki/Data_(computing)&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQ0gIIMCgBMAg&usg=AFQjCNEudeiCi_0HFgdzj0KnJRhxIRPRPA', '\n')
('https://en.wikipedia.org/wiki/Metadata&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQ0gIIMSgCMAg&usg=AFQjCNFRY05jK0c4QakO-YFoTvPfn013IQ', '\n')
('https://en.wikipedia.org/wiki/Data_analysis&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQ0gIIMigDMAg&usg=AFQjCNEwtBoC4KyGymoijiJUcYfkgr1p6w', '\n')
('https://fr.wikipedia.org/wiki/Data_(homonymie)&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFgg0MAk&usg=AFQjCNHxzrXByg4-rj2zllD2MCnkTDWe0g', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:chZvlvbLIsIJ:https://fr.wikipedia.org/wiki/Data_(homonymie)%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAg3MAk&usg=AFQjCNEI0IGMlEht_Lc1l6aftJ2ZThbgEg', '\n')
('https://www.youtube.com/user/datagueule&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFgg5MAo&usg=AFQjCNHgpxg20cdG4wnoULcRirJJtNurJA', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:OMKbWLSVB4QJ:https://www.youtube.com/user/datagueule%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAg8MAo&usg=AFQjCNEybm3Zwr346unQx-7oTk92Vq_V9g', '\n')
('http://www.data.com/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFgg_MAs&usg=AFQjCNE_K3RocyeXQFhYWa4tlNL19sKAXQ', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:HikntWD5aqMJ:http://www.data.com/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhCMAs&usg=AFQjCNGcB8SlqjU0tsxSEmJ9Bcgp70hAcw', '\n')
('http://data.worldbank.org/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFghFMAw&usg=AFQjCNH2NwwJkUkGvN6oCOGVSJ4OIolarw', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:BuQHDbbGLT0J:http://data.worldbank.org/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhIMAw&usg=AFQjCNFHNBkzsNR71hTX9t3rNwbGbrMxdw', '\n')
('http://data.bnf.fr/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFghLMA0&usg=AFQjCNEvZ5gWO0hOX_PQFj3eYUv3OdMXMA', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:z8MGwIoF1bkJ:http://data.bnf.fr/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhOMA0&usg=AFQjCNGTJvbzKA1PEa3jH9fa-bizChljhA', '\n')
('https://www.facebook.com/0data0/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFghRMA4&usg=AFQjCNEwUYPG6WJvzbaU2lwk8-2z9398_Q', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:rT9_WJoHdrYJ:https://www.facebook.com/0data0/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhUMA4&usg=AFQjCNETnGcEqlHE7wmGT7AEgTweVMHBqw', '\n')

例如,我在浏览器上放置链接:

http://webcache.googleusercontent.com/search%3Fq%3Dcache:rT9_WJoHdrYJ:https://www.facebook.com/0data0/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhUMA4&usg=AFQjCNETnGcEqlHE7wmGT7AEgTweVMHBqw

浏览器告诉我: enter image description here

我在问,因为作为普通用户,当在Google上输入内容时,它会向我们提供链接,将我们发送到我们需要的页面,而在那里,我无法成功到达那里。 (我也打算保存文件,但它也显示非常混乱,不可理解)。我不知道如何正确实现解析....?

1 个答案:

答案 0 :(得分:2)

使用以下条件

#your code
if link['href'].startswith('/url?q=') \
    and 'webcache.googleusercontent.com' not in link['href']:
    print link['href'].split('/url?q=')[1].split('&')[0]
    #your code
相关问题