什么是通过.htaccess禁止爬网程序的正确方法?

时间:2019-08-10 10:44:56

标签: .htaccess web-crawler

当我在.htaccess中添加这样的规则时,我已经将我的公司(内部使用)网站编入了Google索引:

    RewriteBase /
    RewriteCond %{HTTPS} off
    RewriteCond %{HTTP:X-Forwarded-Proto} !https
    RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
通过以下一条规则,

robots.txt似乎是正确的:

    User-agent: *
    Disallow: /

我做错了什么吗?

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]
    RewriteCond %{HTTP_USER_AGENT} AdsBot-Google [OR]
            RewriteCond %{HTTP_USER_AGENT} msnbot [OR]
    RewriteCond %{HTTP_USER_AGENT} AltaVista [OR]
    RewriteCond %{HTTP_USER_AGENT} BingPreview [OR]
    RewriteCond %{HTTP_USER_AGENT} spider [OR]
    RewriteCond %{HTTP_USER_AGENT} bingbot [OR]
    RewriteCond %{HTTP_USER_AGENT} DomainSONOCrawler [OR]
    RewriteCond %{HTTP_USER_AGENT} TelegramBot [OR]
    RewriteCond %{HTTP_USER_AGENT} Curl [OR]
    RewriteCond %{HTTP_USER_AGENT} WBSearchBot [OR]
    RewriteCond %{HTTP_USER_AGENT} Slurp [OR]

    RewriteCond %{HTTP_USER_AGENT} Mediapartners-Google [OR]
    RewriteCond %{HTTP_USER_AGENT} Googlebot-Video [OR]
    RewriteCond %{HTTP_USER_AGENT} Googlebot-News [OR]
    RewriteCond %{HTTP_USER_AGENT} Googlebot-Image [OR]
    RewriteCond %{HTTP_USER_AGENT} AdsBot-Google-Mobile [OR]
    RewriteCond %{HTTP_USER_AGENT} APIs-Google [OR]
    RewriteCond %{HTTP_USER_AGENT} AdsBot-Google-Mobile-Apps [OR]
    RewriteCond %{HTTP_USER_AGENT} FeedFetcher-Google [OR]
    RewriteCond %{HTTP_USER_AGENT} Google-Read-Aloud [OR]

    RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    #... more entries, not showing you the whole list
    # as it may contain false positives, find them yourself.
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus
    RewriteRule . - [F,L]

    RewriteBase /
    RewriteCond %{HTTPS} off
    RewriteCond %{HTTP:X-Forwarded-Proto} !https
    RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

0 个答案:

没有答案