Apache Nutch多表自定义MR作业不起作用

时间:2019-08-29 10:31:49

标签: java hadoop hbase nutch nutch2

我对Nutch(2.3.1),Hbase和Hadoop进行了少量设置。我已经在一张表中抓取了一些数据。现在,我必须通过MR解析此表并将输出写入新表。 以下是我的课程

Driver and Mapper类:

...

  static {
    FIELDS.add(WebPage.Field.STATUS);
    FIELDS.add(WebPage.Field.MARKERS);
  }
  public static class Mapper extends GoraMapper<String, WebPage, Text, WebPage> {

    @Override
    protected void map(String key, WebPage page, Context context)
        throws IOException, InterruptedException {
        LOG.info("Testing: " + key);

        Mark.FETCH_MARK.putMark(page, new Utf8("all"));
        // some other logic here

     context.write( new Text(key), page );

...

  public void updateDomains(boolean buildLinkDb) throws Exception {

    NutchJob job = NutchJob.getInstance(getConf(), "updateMarker");

    // === Map ===
    DataStore<String, WebPage> pageStore = StorageUtils.createWebStore(
        job.getConfiguration(), String.class, WebPage.class);
    Query<String, WebPage> query = pageStore.newQuery();
    query.setFields(StorageUtils.toStringArray(FIELDS)); // Note: pages without
                                                         // these fields are
                                                         // skipped
    LOG.info("Updating Markers: 2");
    GoraMapper.initMapperJob(job, query, pageStore, Text.class, WebPage.class,
        MarkerUpdateJob.Mapper.class, null, true);

    job.setNumReduceTasks(1);
    LOG.info( job.getConfiguration().get(Nutch.CRAWL_ID_KEY )  );
    // === Reduce ===
   job.getConfiguration().set(Nutch.CRAWL_ID_KEY, "hms3");

    DataStore<String, WebPage> store2 = StorageUtils.createWebStore(
        job.getConfiguration(), String.class, WebPage.class);

    LOG.info( "Reducer before start: " + job.getConfiguration().get(Nutch.CRAWL_ID_KEY ) );
    GoraReducer.initReducerJob(job, store2, MarkerUpdateReducer.class);

    job.waitForCompletion(true);
  }
}

减速器类:

public class MarkerUpdateReducer extends
    GoraReducer<Text, WebPage, String, WebPage> {

    public static final Logger LOG = MarkerUpdateJob.LOG;
    private static Configuration conf;
    static {
         conf = NutchConfiguration.create();
    }

  @Override
  protected void reduce(Text key, Iterable<WebPage> values, Context context)
      throws IOException, InterruptedException {

    for ( WebPage page : values ) 
    {
        context.write(key.toString(), page);
    }
   }
}

我执行以下查询以在伪模式集群上运行作业

bin/nutch customJob -crawlId b1

当我从hbase终端扫描hms3_webpage时。虽然创建了新表,但为空。我进一步检查了映射器中的日志记录,我知道映射器作业已完成,没有任何数据。 问题出在哪里。这是从一个表复制(和解析)数据并将其保存到新表的正确方法吗? 作为指导,我以Nutch中的hostdbupdate作业为例。

0 个答案:

没有答案