Question

基本上我的客户Magento网站上的Googles网站管理员工具显示了一些非常奇怪的结果，我们注意到，在索引状态页面上，我们已经 1,911页编入索引。现在这一点似乎是正确的，但是当我们点击高级标签时，它会显示我们已阻止 6,947 个网址。现在我需要问的问题是，当网站只有 1,911 页面被索引时，如何阻止 6,947 网址？

现在我在某处读到被阻止的URL可能是Magento内的重复图像。对我来说这是有道理的，因为我们的系统中有很多重复的图像，但我不确定这是否是阻塞URL的原因。

另一个问题可能是robots.txt文件本身。所以我决定查看文件，一切看起来都很好，但每一行如' Disallow：/ 404 / '可能指向错误的方向。

客户网站位于服务器的根目录下，因此在网站的'public_html'部分内，我认为'/ 404 /'部分可能会是根。因此，我必须将网站文件夹名称添加到robots.txt文件中每行的开头，例如：/ [Folder_Name] / 404 /？

任何帮助解决这个问题都会非常感激，我觉得我已经碰壁了。我认为Magento版本是1.5，如果这可以帮助。

再次感谢您的帮助。

ROBOTS.TXT文件代码

User-agent: *

Allow: /
Sitemap: http://www.websitename/sitemap.xml

# Directories
Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /includes/
Disallow: /js/
Disallow: /lib/
Disallow: /magento/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /skin/
Disallow: /stats/
Disallow: /var/
# Paths (clean URLs)
Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
# Files
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt
# Paths (no clean URLs)
Disallow: /*?p=*&
Disallow: /*?SID=
Disallow: /*?invis=
Disallow: /*?tag=
Disallow: /*?osCsid=
Disallow: /*?manufacturers_id=
Disallow: /*?currency=

Answer 1

服务器处理文件夹的方式无关紧要。

如果您的robots.txt可以从http://example.com/robots.txt访问，则Disallow: /404/这样的规则会阻止

http://example.com/404/
http://example.com/404/foo
http://example.com/404/foo/bar
等

另请注意，您不得在记录中包含换行符，因此

User-agent: *

Allow: /
Sitemap: http://www.websitename/sitemap.xml

# Directories
Disallow: /404/

应该是：

User-agent: *
Allow: /
Sitemap: http://www.websitename/sitemap.xml
# Directories
Disallow: /404/

似乎你不需要Allow: /（它不是原始robots.txt规范的一部分，但即使对于理解Allow的解析器，它也是默认允许的任何内容，不允许）。

现在我需要问的问题是，当网站只有大约1,911页被索引时，如何阻止6,947个网址？

我不能在这里关注你。虽然Google可能仍会对已阻止的网址进行索引但不会对其进行抓取，但并非所有被阻止的网址都会出现这种情况。因此，索引URL的数量通常不包括所有被阻止的URL。由于Google不允许访问/抓取阻止的网址，因此无法知道这些网址是否存在或存在多少。 Google在查找指向这些网址的链接时（从您的网站内部以及从外部网站获取）了解这些网址。

因此，如果有100个指向路径以/poll/开头的不同网址的链接，Google可能会将这100个网址列为已阻止。

Answer 2

您的站点地图可能与robots.txt文件冲突。

Google尝试索引sitemap.xml文件中的所有内容，但它发现它无法索引robots.txt文件阻止的网页。

在我的案例www.workwearwebshop.nl中，站点地图包含以/ catalog / product / view开头且被robots.txt阻止的页面。如果你在robots.txt中注释掉这一行，谷歌可以抓住这些产品。剩下的问题是magento应该有比这些更好的url（从类别名称开始而不是/ catalog / product / view）

Magento - robots.txt可能会阻止Google网站站长工具上的网址

2 个答案: