如何使用Apache Nutch抓取特定网站?

时间:2016-01-12 12:44:43

标签: apache nutch

我已按照以下网址成功完成,直到分步:反向链接

https://wiki.apache.org/nutch/NutchTutorial#Crawl_your_first_website

但我没有得到任何有关他们的数据

我是这个技术的新手,

如果有人在成功之前完成了,请提供 steps / demo / site / example 。 和 请不要给出粗略的步骤。

2 个答案:

答案 0 :(得分:0)

首先安装nutch:

,粘贴:

<property>
    <name>http.agent.name</name>
    <value>My Nutch Spider</value>
</property>

在你的nutch-default.xml下:添加

<property>
  <name>http.robot.rules.whitelist</name>
  <value>http://nihilent.com/</value>
  <description>Comma separated list of hostnames or IP addresses to ignore
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>

在regex-urlfilter.txt下

# accept anything else
+.
+^http://([a-z0-9]*\.)*http://nihilent.com/

并评论

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

然后运行以下命令

bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

现在检查crawl / crawldb文件夹中的数据&amp;其他成功。

答案 1 :(得分:0)

下面是一些可以帮助你以各种方式做Nutch的命令

  • 这些命令包含控制台上的直接crwaling,大数据读取dumpin等
  • 我提到我所做的所有可用命令,请根据您的要求修改

命令Nutch

bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s4=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments

bin/nutch commoncrawldump -outputDir hdfs://localhost:9000/dfs -segment /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ -jsonArray -reverseKey -SimpleDateFormat -epochFilename

bin/nutch readseg -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/1

bin/nutch readseg -get /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments http://1465212304000.html -nofetch -nogenerate -noparse -noparsedata -noparsetext
  

bin / nutch parsechecker -dumpText http://nihilent.com/

bin/nutch readlinkdb /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/linkdb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn/3

bin/nutch readdb crawl/crawldb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn

bin/nutch readdb crawl/crawldb -dump /hdfs://localhost:9000/dfs

hadoop fs -copyFromLocal 

hadoop fs -copyFromLocal /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/data/commoncrawl/com hdfs://localhost:9000/dfs

因为避免三明治数据而添加了新答案