我对Nutch(2.3.1),Hbase和Hadoop进行了少量设置。我已经在一张表中抓取了一些数据。现在,我必须通过MR解析此表并将输出写入新表。 以下是我的课程
Driver and Mapper类:
...
static {
FIELDS.add(WebPage.Field.STATUS);
FIELDS.add(WebPage.Field.MARKERS);
}
public static class Mapper extends GoraMapper<String, WebPage, Text, WebPage> {
@Override
protected void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
LOG.info("Testing: " + key);
Mark.FETCH_MARK.putMark(page, new Utf8("all"));
// some other logic here
context.write( new Text(key), page );
...
public void updateDomains(boolean buildLinkDb) throws Exception {
NutchJob job = NutchJob.getInstance(getConf(), "updateMarker");
// === Map ===
DataStore<String, WebPage> pageStore = StorageUtils.createWebStore(
job.getConfiguration(), String.class, WebPage.class);
Query<String, WebPage> query = pageStore.newQuery();
query.setFields(StorageUtils.toStringArray(FIELDS)); // Note: pages without
// these fields are
// skipped
LOG.info("Updating Markers: 2");
GoraMapper.initMapperJob(job, query, pageStore, Text.class, WebPage.class,
MarkerUpdateJob.Mapper.class, null, true);
job.setNumReduceTasks(1);
LOG.info( job.getConfiguration().get(Nutch.CRAWL_ID_KEY ) );
// === Reduce ===
job.getConfiguration().set(Nutch.CRAWL_ID_KEY, "hms3");
DataStore<String, WebPage> store2 = StorageUtils.createWebStore(
job.getConfiguration(), String.class, WebPage.class);
LOG.info( "Reducer before start: " + job.getConfiguration().get(Nutch.CRAWL_ID_KEY ) );
GoraReducer.initReducerJob(job, store2, MarkerUpdateReducer.class);
job.waitForCompletion(true);
}
}
减速器类:
public class MarkerUpdateReducer extends
GoraReducer<Text, WebPage, String, WebPage> {
public static final Logger LOG = MarkerUpdateJob.LOG;
private static Configuration conf;
static {
conf = NutchConfiguration.create();
}
@Override
protected void reduce(Text key, Iterable<WebPage> values, Context context)
throws IOException, InterruptedException {
for ( WebPage page : values )
{
context.write(key.toString(), page);
}
}
}
我执行以下查询以在伪模式集群上运行作业
bin/nutch customJob -crawlId b1
当我从hbase终端扫描hms3_webpage时。虽然创建了新表,但为空。我进一步检查了映射器中的日志记录,我知道映射器作业已完成,没有任何数据。 问题出在哪里。这是从一个表复制(和解析)数据并将其保存到新表的正确方法吗? 作为指导,我以Nutch中的hostdbupdate作业为例。