如何使用请求从谷歌搜索页面获取所有HTML?

时间:2015-01-30 00:17:59

标签: python html parsing url python-requests

我想从这个网址(“https://www.google.com/search?q=urban+outfitters+facebook”)获取html,这样我就可以解析它以获取显示在页面上的所有链接,这样我最终可以获得Facebook显示的第一个链接的用户名(https://www.facebook.com/urbanoutfitters

我能够使用请求从页面获取我需要的所有html,但我似乎无法从谷歌获取所有文本。

例如,请参阅下面的代码:

import requests
url = "https://www.google.com/search?q=urban+outfitters+facebook" 
print requests.get(url).text

另外,我查看了API,但我认为使用请求更简单。我能够使用Selenium做到这一点,所以我不明白为什么我不能使用请求这样做。

这是我使用请求的回复:

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="/images/google_favicon_128.png" itemprop="image"><title>Google</title><script>(function(){window.google={kEI:'DM7KVKwpivaCBJWRg6gL',kEXPI:'4010073,4011559,4020346,4020562,4020873,4021587,4021598,4024625,4025891,4027899,4028063,4028126,4028129,4028468,4028508,4028519,4028585,4028940,8300111,8500393,8500852,8501130,10200083,10200855,10200905',authuser:0,kSID:'DM7KVKwpivaCBJWRg6gL'};google.kHL='en';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||google.kEI};google.https=function(){return"https:"==window.location.protocol};google.ml=function(){};google.time=function(){return(new Date).getTime()};google.log=function(a,b,d,e,k){var c=new Image,h=google.lc,f=google.li,g="",l=google.ls||"";c.onerror=c.onload=c.onabort=function(){delete h[f]};h[f]=c;d||-1!=b.search("&ei=")||(e=google.getEI(e),g="&ei="+e,e!=google.kEI&&(g+="&lei="+google.kEI));a=d||"/"+(k||"gen_204")+"?atyp=i&ct="+a+"&cad="+b+g+l+"&zx="+google.time();/^http:/i.test(a)&&google.https()?(google.ml(Error("a"),!1,{src:a,glmm:1}),delete h[f]):(c.src=a,google.li=f+1)};google.y={};google.x=function(a,b){google.y[a.id]=[a,b];return!1};google.load=function(a,b,d){google.x({id:a+m++},function(){google.load(a,b,d)})};var m=0;})();google.kCSI={};var _gjwl=location;function _gjuc(){var a=_gjwl.href.indexOf("#");if(0<=a&&(a=_gjwl.href.substring(a),0<a.indexOf("&q=")||0<=a.indexOf("#q="))&&(a=a.substring(1),-1==a.indexOf("#"))){for(var d=0;d<a.length;){var b=d;"&"==a.charAt(b)&&++b;var c=a.indexOf("&",b);-1==c&&(c=a.length);b=a.substring(b,c);if(0==b.indexOf("fp="))a=a.substring(0,d)+a.substring(c,a.length),c=d;else if("cad=h"==b)return 0;d=c}_gjwl.href="/search?"+a+"&cad=h";return 1}return 0}
function _gjh(){!_gjuc()&&window.google&&google.x&&google.x({id:"GJH"},function(){google.nav&&google.nav.gjh&&google.nav.gjh()})};window._gjh&&_gjh();</script><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}</style><style>body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;overflow-y:scroll}#gog{padding:3px 8px 0}td{line-height:.8em}.gac_m td{line-height:17px}form{margin-bottom:20px}.h{color:#36c}.q{color:#00c}.ts td{padding:0}.ts{border-collapse:collapse}em{font-weight:bold;font-style:normal}.lst{height:25px;width:496px}.gsfi,.lst{font:18px arial,sans-serif}.gsfs{font:17px arial,sans-serif}.ds{display:inline-box;display:inline-block;margin:3px 0 4px;margin-left:4px}input{font-family:inherit}a.gb1,a.gb2,a.gb3,a.gb4{color:#11c !important}body{background:#fff;color:black}a{color:#11c;text-decoration:none}a:hover,a:active{text-decoration:underline}.fl a{color:#36c}a:visited{color:#551a8b}a.gb1,a.gb4{text-decoration:underline}a.gb3:hover{text-decoration:none}#ghead a.gb2:hover{color:#fff !important}.sblc{padding-top:5px}.sblc a{display:block;margin:2px 0;margin-left:13px;font-size:11px}.lsbb{background:#eee;border:solid 1px;border-color:#ccc #999 #999 #ccc;height:30px}.lsbb{display:block}.ftl,#fll a{display:inline-block;margin:0 12px}.lsb{background:url(/images/srpr/nav_logo80.png) 0 -258px repeat-x;border:none;color:#000;cursor:pointer;height:30px;margin:0;outline:0;font:15px arial,sans-serif;vertical-align:top}.lsb:active{background:#ccc}.lst:focus{outline:none}</style><script></script></head><body bgcolor="#fff"><script>(function(){var src='/images/nav_logo176.png';var iesg=false;document.body.onload = function(){window.n && window.n();if (document.images){new Image().src=src;}
if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}
}
})();</script><div id="mngb">   <div id=gbar><nobr><b class=gb1>Search</b> <a class=gb1 href="https://www.google.com/imghp?hl=en&tab=wi">Images</a> <a class=gb1 href="https://maps.google.com/maps?hl=en&tab=wl">Maps</a> <a class=gb1 href="https://play.google.com/?hl=en&tab=w8">Play</a> <a class=gb1 href="https://www.youtube.com/?tab=w1">YouTube</a> <a class=gb1 href="https://news.google.com/nwshp?hl=en&tab=wn">News</a> <a class=gb1 href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class=gb1 href="https://drive.google.com/?tab=wo">Drive</a> <a class=gb1 style="text-decoration:none" href="http://www.google.com/intl/en/options/"><u>More</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a href="http://www.google.com/history/optout?hl=en" class=gb4>Web History</a> | <a  href="/preferences?hl=en" class=gb4>Settings</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div>  </div><center><span id="prt" style="display:block"> <div><style>.pmoabs{background-color:#fff;border:1px solid #E5E5E5;color:#666;font-size:13px;padding-bottom:20px;position:absolute;right:2px;top:3px;z-index:986}#pmolnk{border-radius:2px;-moz-border-radius:2px;-webkit-border-radius:2px}.kd-button-submit{border:1px solid #3079ed;background-color:#4d90fe;background-image:-webkit-gradient(linear,left top,left bottom,from(#4d90fe),to(#4787ed));background-image:-webkit-linear-gradient(top,#4d90fe,#4787ed);background-image:-moz-linear-gradient(top,#4d90fe,#4787ed);background-image:-ms-linear-gradient(top,#4d90fe,#4787ed);background-image:-o-linear-gradient(top,#4d90fe,#4787ed);background-image:linear-gradient(top,#4d90fe,#4787ed);filter:progid:DXImageTransform.Microsoft.gradient(startColorStr='#4d90fe',EndColorStr='#4787ed')}.kd-button-submit:hover{border:1px solid #2f5bb7;background-color:#357ae8;background-image:-webkit-gradient(linear,left top,left bottom,from(#4d90fe),to(#357ae8));background-image:-webkit-linear-gradient(top,#4d90fe,#357ae8);background-image:-moz-linear-gradient(top,#4d90fe,#357ae8);background-image:-ms-linear-gradient(top,#4d90fe,#357ae8);background-image:-o-linear-gradient(top,#4d90fe,#357ae8);background-image:linear-gradient(top,#4d90fe,#357ae8);filter:progid:DXImageTransform.Microsoft.gradient(startColorStr='#4d90fe',EndColorStr='#357ae8')}.kd-button-submit:active{-webkit-box-shadow:inset 0 1px 2px rgba(0,0,0,0.3);-moz-box-shadow:inset 0 1px 2px rgba(0,0,0,0.3);box-shadow:inset 0 1px 2px rgba(0,0,0,0.3)}#pmolnk a{color:#fff;display:inline-block;font-weight:bold;padding:5px 20px;text-decoration:none;white-space:nowrap}.xbtn{color:#999;cursor:pointer;font-size:23px;line-height:5px;padding-top:5px}.padi{padding:0 8px 0 10px}.padt{padding:5px 20px 0 0;color:#444}.pads{text-align:left;max-width:200px}</style> <div class="pmoabs" id="pmocntr2" style="behavior:url(#default#userdata);display:none"> <table border="0"> <tr> <td colspan="2"> <div class="xbtn" onclick="google.promos&&google.promos.toast&& google.promos.toast.cpc()" style="float:right">&times;</div> </td> </tr> <tr> <td class="padi" rowspan="2"> <img src="/images/icons/product/chrome-48.png"> </td> <td class="pads">A faster way to browse the web</td> </tr> <tr> <td class="padt"> <div class="kd-button-submit" id="pmolnk"> <a href="/chrome/index.html?hl=en&amp;brand=CHNG&amp;utm_source=en-hpp&amp;utm_medium=hpp&amp;utm_campaign=en" onclick="google.promos&&google.promos.toast&& google.promos.toast.cl()">Install Google Chrome</a> </div> </td> </tr> </table> </div> <script type="text/javascript">(function(){var a={o:{}};a.o.Pa=50;a.o.Oa=10;a.o.ca="body";a.o.La=!0;a.o.Ea=function(b,c){var d=a.o.Ja();a.o.Ka(d,b,c);a.o.Na(d);a.o.La&&a.o.Ma(d)};a.o.Na=function(b){(b=a.o.ba(b))&&0<b.forms.length&&b.forms[0].submit()};a.o.Ja=function(){var b=document.createElement("iframe");b.height=0;b.width=0;b.style.overflow="hidden";b.style.top=b.style.left="-100px";b.style.position="absolute";document.body.appendChild(b);return b};a.o.ba=function(b){return b.contentDocument||b.contentWindow.document};a.o.Ka=function(b,c,d){b=a.o.ba(b);b.open();d=["<",a.o.ca,'><form method=POST action="',d,'">'];for(var e in c)c.hasOwnProperty(e)&&d.push('<textarea name="',e,'">',c[e],"</textarea>");d.push("</form></",a.o.ca,">");b.write(d.join(""));b.close()};a.o.aa=function(b,c){c>a.o.Oa?google&&google.ml&&google.ml(Error("ogcdr"),!1,{cause:"timeout"}):b.contentWindow?a.o.Qa(b):window.setTimeout(function(){a.o.aa(b,c+1)},a.o.Pa)};a.o.Qa=function(b){document.body.removeChild(b)};a.o.Ma=function(b){a.o.Ra(b,"load",function(){a.o.aa(b,0)})};a.o.Ra=function(b,c,d){b.addEventListener?b.addEventListener(c,d,!1):b.attachEvent&&b.attachEvent("on"+c,d)};var m={Va:0,D:1,F:2,S:5};a.k={};a.k.T={Ha:"i",X:"d",Ia:"l"};a.k.A={R:"0",H:"1"};a.k.U={O:1,X:2,P:3};a.k.w={ta:"a",wa:"g",C:"c",za:"u",ya:"t",R:"p",xa:"pid",va:"eid",Aa:"at"};a.k.Ca=window.location.protocol+"//www.google.com/_/og/promos/";a.k.Ba="g";a.k.Da="z";a.k.Q=function(b,c,d,e){var f=null;switch(c){case m.D:f=window.gbar.up.gpd(b,d,!0);break;case m.S:f=window.gbar.up.gcc(e)}return null==f?0:parseInt(f,10)};a.k.ia=function(b,c,d){return c==m.D?null!=window.gbar.up.gpd(b,d,!0):!1};a.k.V=function(b,c,d,e,f,h,k,l){var g={};g[a.k.w.R]=b;g[a.k.w.wa]=c;g[a.k.w.ta]=d;g[a.k.w.Aa]=e;g[a.k.w.va]=f;g[a.k.w.xa]=1;k&&(g[a.k.w.C]=k);l&&(g[a.k.w.za]=l);if(h)g[a.k.w.ya]=h;else return google.ml(Error("knu"),!1,{cause:"Token is not found"}),null;return g};a.k.W=function(b,c,d){if(b){var e=c?a.k.Ba:a.k.Da;c&&d&&(e+="?authuser="+d);a.o.Ea(b,a.k.Ca+e)}};a.k.Ga=function(b,c,d,e,f,h,k){b=a.k.V(c,b,a.k.T.X,a.k.U.X,d,f,null,e);a.k.W(b,h,k)};a.k.Fa=function(b,c,d,e,f,h,k){b=a.k.V(c,b,a.k.T.Ha,a.k.U.O,d,f,e,null);a.k.W(b,h,k)};a.k.la=function(b,c,d,e,f,h,k,l,g,n){switch(c){case m.S:window.gbar.up.dpc(e,f);break;case m.D:window.gbar.up.spd(b,d,1,!0);break;case m.F:g=g||!1,l=l||"",h=h||0,k=k||a.k.A.H,n=n||0,a.k.Ga(e,h,k,f,l,g,n)}};a.k.ja=function(b,c,d,e,f){return c==m.D?0<d&&a.k.Q(b,c,e,f)>=d:!1};a.k.ga=function(b,c,d,e,f,h,k,l,g,n){switch(c){case m.S:window.gbar.up.iic(e,f);break;case m.D:c=a.k.Q(b,c,d,e)+1;window.gbar.up.spd(b,d,c.toString(),!0);break;case m.F:g=g||!1,l=l||"",h=h||0,k=k||a.k.A.R,n=n||0,a.k.Fa(e,h,k,1,l,g,n)}};a.k.na=function(b,c,d,e,f,h){b=a.k.V(c,b,a.k.T.Ia,a.k.U.P,d,e,null,null);a.k.W(b,f,h)};var p={Ta:"a",Wa:"l",Ua:"c",ka:"d",P:"h",O:"i",gb:"n",H:"x",cb:"ma",eb:"mc",fb:"mi",Xa:"pa",Ya:"pc",$a:"pi",bb:"pn",ab:"px",Za:"pd",hb:"gpa",jb:"gpi",kb:"gpn",lb:"gpx",ib:"gpd"};a.i={};a.i.v={oa:"hplogo",Sa:"pmocntr2"};a.i.A={ea:"0",H:"1",ma:"2"};a.i.p=document.getElementById(a.i.v.Sa);a.i.pa=16;a.i.qa=2;a.i.ra=20;google.promos=google.promos||{};google.promos.toast=google.promos.toast||{};a.i.G=function(b){a.i.p&&(a.i.p.style.display=b?"":"none",a.i.p.parentNode&&(a.i.p.parentNode.style.position=b?"relative":""))};a.i.$=function(b){try{if(a.i.p&&b&&b.es&&b.es.m){var c=window.gbar.rtl(document.body)?"left":"right";a.i.p.style[c]=b.es.m-a.i.pa+a.i.qa+"px";a.i.p.style.top=a.i.ra+"px"}}catch(d){google.ml(d,!1,{cause:a.i.s+"_PT"})}};google.promos.toast.cl=function(){try{a.i.I==m.F&&a.k.na(a.i.J,a.i.B,a.i.A.ma,a.i.N,a.i.L,a.i.M),window.gbar.up.sl(a.i.B,a.i.s,p.P,a.i.K(),1)}catch(b){google.ml(b,!1,{cause:a.i.s+"_CL"})}};google.promos.toast.cpc=function(){try{a.i.p&&(a.i.G(!1),a.k.la(a.i.p,a.i.I,a.i.v.Y,a.i.J,a.i.da,a.i.B,a.i.A.H,a.i.N,a.i.L,a.i.M),window.gbar.up.sl(a.i.B,a.i.s,p.ka,a.i.K(),1))}catch(b){google.ml(b,!1,{cause:a.i.s+"_CPC"})}};a.i.Z=function(){try{if(a.i.p){var b=276,c=document.getElementById(a.i.v.oa);c&&(b=Math.max(b,c.offsetWidth));var d=parseInt(a.i.p.style.right,10)||0;a.i.p.style.visibility=2*(a.i.p.offsetWidth+d)+b>document.body.clientWidth?"hidden":""}}catch(e){google.ml(e,!1,{cause:a.i.s+"_HOSW"})}};a.i.fa=function(){var b=["gpd","spd","aeh","sl"];if(!window.gbar||!window.gbar.up)return!1;for(var c=0,d;d=b[c];c++)if(!(d in window.gbar.up))return!1;return!0};a.i.ha=function(){return a.i.p.currentStyle&&"absolute"!=a.i.p.currentStyle.position};google.promos.toast.init=function(b,c,d,e,f,h,k,l,g,n,q,r){try{a.i.fa()?a.i.p&&(e==m.F&&!l==!g?(google.ml(Error("tku"),!1,{cause:"zwieback: "+g+", gaia: "+l}),a.i.G(!1)):(a.i.v.C="toast_count_"+c+(q?"_"+q:""),a.i.v.Y="toast_dp_"+c+(r?"_"+r:""),a.i.s=d,a.i.B=b,a.i.I=e,a.i.J=c,a.i.da=f,a.i.N=l?l:g,a.i.L=!!l,a.i.M=k,a.k.ia(a.i.p,e,a.i.v.Y,c)||a.k.ja(a.i.p,e,h,a.i.v.C,c)||a.i.ha()?a.i.G(!1):(a.k.ga(a.i.p,e,a.i.v.C,c,f,a.i.B,a.i.A.ea,a.i.N,a.i.L,a.i.M),n||(window.gbar.up.aeh(window,"resize",a.i.Z),window.lol=
a.i.Z,window.gbar.elr&&a.i.$(window.gbar.elr()),window.gbar.elc&&window.gbar.elc(a.i.$),a.i.G(!0)),window.gbar.up.sl(a.i.B,a.i.s,p.O,a.i.K())))):google.ml(Error("apa"),!1,{cause:a.i.s+"_INIT"})}catch(t){google.ml(t,!1,{cause:a.i.s+"_INIT"})}};a.i.K=function(){var b=a.k.Q(a.i.p,a.i.I,a.i.v.C,a.i.J);return"ic="+b};})();</script> <script type="text/javascript">(function(){var sourceWebappPromoID=144002;var sourceWebappGroupID=5;var payloadType=5;var cookieMaxAgeSec=2592000;var dismissalType=5;var impressionCap=25;var gaiaXsrfToken='';var zwbkXsrfToken='';var kansasDismissalEnabled=false;var sessionIndex=0;var invisible=false;window.gbar&&gbar.up&&gbar.up.r&&gbar.up.r(payloadType,function(show){if (show){google.promos.toast.init(sourceWebappPromoID,sourceWebappGroupID,payloadType,dismissalType,cookieMaxAgeSec,impressionCap,sessionIndex,gaiaXsrfToken,zwbkXsrfToken,invisible,'0612');}
});})();</script> </div> </span><br clear="all" id="lgpd"><div id="lga"><img alt="Google" height="95" src="/images/srpr/logo9w.png" style="padding:28px 0 14px" width="269" id="hplogo" onload="window.lol&&lol()"><br><br></div><form action="/search" name="f"><table cellpadding="0" cellspacing="0"><tr valign="top"><td width="25%">&nbsp;</td><td align="center" nowrap=""><input name="ie" value="ISO-8859-1" type="hidden"><input value="en" name="hl" type="hidden"><input name="source" type="hidden" value="hp"><div class="ds" style="height:32px;margin:4px 0"><input style="color:#000;margin:0;padding:5px 8px 0 6px;vertical-align:top" autocomplete="off" class="lst" value="" title="Google Search" maxlength="2048" name="q" size="57"></div><br style="line-height:0"><span class="ds"><span class="lsbb"><input class="lsb" value="Google Search" name="btnG" type="submit"></span></span><span class="ds"><span class="lsbb"><input class="lsb" value="I'm Feeling Lucky" name="btnI" onclick="if(this.form.q.value)this.checked=1; else top.location='/doodles/'" type="submit"></span></span></td><td class="fl sblc" align="left" nowrap="" width="25%"><a href="/advanced_search?hl=en&amp;authuser=0">Advanced search</a><a href="/language_tools?hl=en&amp;authuser=0">Language tools</a></td></tr></table><input id="gbv" name="gbv" type="hidden" value="1"></form><div id="gac_scont"></div><div style="font-size:83%;min-height:3.5em"><br></div><span id="footer"><div style="font-size:10pt"><div style="margin:19px auto;text-align:center" id="fll"><a href="/intl/en/ads/">Advertising&nbsp;Programs</a><a href="/services/">Business Solutions</a><a href="https://plus.google.com/116899029375914044550" rel="publisher">+Google</a><a href="/intl/en/about.html">About Google</a></div></div><p style="color:#767676;font-size:8pt">&copy; 2015 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p></span></center><div id="xjsd"></div><div id="xjsi" data-jiis="bp"><script>(function(){function c(b){window.setTimeout(function(){var a=document.createElement("script");a.src=b;document.getElementById("xjsd").appendChild(a)},0)}google.dljp=function(b,a){google.xjsu=b;c(a)};google.dlj=c;})();(function(){window.google.xjsrm=[];})();if(google.y)google.y.first=[];if(!google.xjs){window._=window._||{};window._._DumpException=function(e){throw e};if(google.timers&&google.timers.load.t){google.timers.load.t.xjsls=new Date().getTime();}google.dljp('/xjs/_/js/k\x3dxjs.hp.en_US.4dB-kXZgo4g.O/m\x3dsb_he,d/rt\x3dj/d\x3d1/t\x3dzcms/rs\x3dACT90oFyTgnV60GhNLdstOIcFET3IVANCA','/xjs/_/js/k\x3dxjs.hp.en_US.4dB-kXZgo4g.O/m\x3dsb_he,d/rt\x3dj/d\x3d1/t\x3dzcms/rs\x3dACT90oFyTgnV60GhNLdstOIcFET3IVANCA');google.xjs=1;}google.pmc={"sb_he":{"agen":true,"cgen":true,"client":"heirloom-hp","dh":true,"ds":"","exp":"msedr","fl":true,"host":"google.com","jam":0,"jsonp":true,"msgs":{"cibl":"Clear Search","dym":"Did you mean:","lcky":"I\u0026#39;m Feeling Lucky","lml":"Learn more","oskt":"Input tools","psrc":"This search was removed from your \u003Ca href=\"/history\"\u003EWeb History\u003C/a\u003E","psrl":"Remove","sbit":"Search by image","srch":"Google Search"},"ovr":{},"pq":"","refoq":true,"scd":10,"sce":5,"stok":"iXw-xWnUXlH7Fp6SrUErmgr3X8g"},"d":{}};google.y.first.push(function(){if(google.med){google.med('init');google.initHistory();google.med('history');}});if(google.j&&google.j.en&&google.j.xi){window.setTimeout(google.j.xi,0);}
</script></div></body></html>

3 个答案:

答案 0 :(得分:3)

http://ajax.googleapis.com/ajax/services/search/web?v=1.0

替换网址的开头

现在应该是这样的: http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=urban+outfitters+facebook

使用Python json解析器,您可以检索第一个URL。

import requests
import json
url = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=urban+outfitters+facebook" 
google_result = json.loads(requests.get(url).text)
print google_result["responseData"]["results"][0]["url"]

答案 1 :(得分:0)

由于 Laurent 的旧答案不起作用(并且我不想使用 API),因此我搜索了另一种方法并从此处找到了该方法:https://hackernoon.com/how-to-scrape-google-with-python-bo7d2tal

import requests
from bs4 import BeautifulSoup

query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"

headers = {"user-agent" : USER_AGENT}
resp = requests.get(URL, headers=headers)

if resp.status_code == 200:
    soup = BeautifulSoup(resp.content, "html.parser")

使用“汤”可以从 html 中提取其他所有内容。

答案 2 :(得分:0)

确保您使用的是 user-agent,因为 Google 将您的脚本视为 python-requests。那是你面临的问题。您需要使用 user-agent 伪造真实的用户访问。 List 个用户代理。

online IDE 中的代码和示例:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=urban outfitters facebook', headers=headers).text
soup = BeautifulSoup(html, 'lxml')

for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('a')['href']
  print(f'{title}\n{link}\n')

输出:

Urban Outfitters - Verified Page | Facebook
https://www.facebook.com/urbanoutfitters/

Urban Outfitters - Home | Facebook
https://www.facebook.com/urbanoutfitterseurope/

Urban Outfitters - Facebook - Urban Outfitters - Blog
https://blog.urbanoutfitters.com/facebook

Contact Us - Urban Outfitters
https://www.urbanoutfitters.com/en-gb/help/contact-us

Frank & Funny Facebook Card | Urban Outfitters
https://www.urbanoutfitters.com/shop/frank--funny-facebook-card?color=&parentid=SALE_APT&quantity=1&type=REGULAR

Urban Outfitters
https://www.urbanoutfitters.com/

Urban Outfitters (@urbanoutfitters) • Instagram photos and ...
https://www.instagram.com/urbanoutfitters/

或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 实现这些结果。这是一个付费 API,可免费试用 5,000 次搜索。

如果您只想获取 RAW-HTML 链接,可以调用 ['search_metadata']['raw_html_file'] 或者只需 print(results) 即可获取所有数据。

要集成的代码:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "urban outfitters facebook",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

html = results['search_metadata']['raw_html_file']
print(f'Raw HTML: {html}')

print(json.dumps(results, indent = 2, ensure_ascii = False))

输出:

https://serpapi.com/searches/f4b6c93572fd22f8/609cb4f5c640d28919b34dad.html

...
"organic_results": [
  {
    "position": 1,
    "title": "Urban Outfitters - Verified Page | Facebook",
    "link": "https://www.facebook.com/urbanoutfitters/",
    "displayed_link": "https://www.facebook.com › ... › Clothing (Brand)",
    "snippet": "Urban Outfitters. 2187043 likes · 4157 talking about this · 169423 were here. Visit us at www.urbanoutfitters.com. Always open, always awesome.",
    "sitelinks": {
      "expanded": [
        {
          "title": "Urban Outfitters",
          "link": "https://en-gb.facebook.com/urbanoutfitters",
          "snippet": "Urban Outfitters. 2187043 likes · 4259 talking about this ..."
        },
        {
          "title": "Instagram",
          "link": "https://www.facebook.com/urbanoutfitters/app/168188869963563/",
          "snippet": "Block Page. More. Send Message. See more of Urban Outfitters on ..."
        },
        {
          "title": "About",
          "link": "https://www.facebook.com/urbanoutfitters/about/",
          "snippet": "The official Facebook page of Urban Outfitters. Questions ..."
        },
        {
          "title": "Events",
          "link": "https://www.facebook.com/urbanoutfitters/events/",
          "snippet": "Urban Outfitters does not have any upcoming events. Past Events ..."
        }
      ]
    }
  }
]...

<块引用>

免责声明,我为 SerpApi 工作。

相关问题