在nutch solrindex命令之后Solr索引为空

时间:2011-08-05 00:29:26

标签: solr nutch

我正在使用Nutch和Solr索引文件共享。

我首先发出:bin / nutch抓取网址

这给了我:

solrUrl is not set, indexing will be skipped...
crawl started in: crawl-20110804191414
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2011-08-04 19:14:14
Injector: crawlDb: crawl-20110804191414/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-08-04 19:14:16, elapsed: 00:00:02
Generator: starting at 2011-08-04 19:14:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20110804191414/segments/20110804191418
Generator: finished at 2011-08-04 19:14:20, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2011-08-04 19:14:20
Fetcher: segment: crawl-20110804191414/segments/20110804191418
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
fetching file:///mnt/public/Personal/Reminder Building Security.htm
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-08-04 19:14:22, elapsed: 00:00:02
ParseSegment: starting at 2011-08-04 19:14:22
ParseSegment: segment: crawl-20110804191414/segments/20110804191418
ParseSegment: finished at 2011-08-04 19:14:23, elapsed: 00:00:01
CrawlDb update: starting at 2011-08-04 19:14:23
CrawlDb update: db: crawl-20110804191414/crawldb
CrawlDb update: segments: [crawl-20110804191414/segments/20110804191418]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-08-04 19:14:24, elapsed: 00:00:01
Generator: starting at 2011-08-04 19:14:24
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2011-08-04 19:14:25
LinkDb: linkdb: crawl-20110804191414/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/nutch/nutch-1.3/runtime/local/crawl-20110804191414/segments/20110804191418
LinkDb: finished at 2011-08-04 19:14:26, elapsed: 00:00:01
crawl finished: crawl-20110804191414

然后我:bin / nutch solrindex http://localhost:8983/solr/ crawl-20110804191414 / crawldb crawl-20110804191414 / linkdb crawl-20110804191414 / segments / *

这给了我:

SolrIndexer: starting at 2011-08-04 19:17:07
SolrIndexer: finished at 2011-08-04 19:17:08, elapsed: 00:00:01

当我在solr上执行查询时,我得到:

<response>
     <lst name="responseHeader">
          <int name="status">0</int>
          <int name="QTime">2</int>
          <lst name="params">
               <str name="indent">on</str>
               <str name="start">0</str>
               <str name="q">*:*</str>
               <str name="version">2.2</str>
               <str name="rows">10</str>
          </lst>
     </lst>
     <result name="response" numFound="0" start="0"/>
</response>

:(

请注意,当我尝试使用protocol-http来抓取网站时,这种方式正常,但是当我使用协议文件来抓取文件系统时,这种方式无效。

--- --- EDIT 今天再次尝试此操作后,我注意到名称中带有空格的文件导致404错误。这是我正在索引的共享上的很多文件。但是,thumbs.db文件正好。这告诉我问题不是我想的那样。

2 个答案:

答案 0 :(得分:0)

今天我花了很多时间来回顾你的步伐。我最终在/opt/nutch/src/java/org/apache/nutch/indexer/IndexerMapReduce.java中使用了printf调试,它向我展示了我尝试索引的每个URL出现两次,一次以file://开头/ var / www / Engineering /,正如我最初指定的那样,一旦从文件开始:/ u / u60 / Engineering /。在这个系统上,/ var / www / Engineering是/ u / u60 / Engineering的符号链接。此外,/ var / www / Engineering URL被拒绝,因为未提供parseText字段,并且因为未提供fetchDatum字段而拒绝了/ u / u60 / Engineering URL。在/ u / u60 / Engineering表单中指定原始URL解决了我的问题。希望在这种情况下有助于下一次闷烧。

答案 1 :(得分:0)

这是因为solr没有获取索引数据。好像你没有正确执行以前的命令。重新启动整个过程,然后尝试最后一个命令。 从此处复制命令:https://wiki.apache.org/nutch/NutchTutorial或在youtube上引用我的视频 - https://www.youtube.com/watch?v=aEap3B3M-PU&t=449s