通过分布式缓存在Mapper中访问文件

时间:2014-02-19 13:54:31

标签: hadoop

我想在Mapper中访问分布式文件的内容。下面是我编写的代码,它生成分布式缓存的文件名。请帮我查看文件的内容

   public class DistCacheExampleMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text >
     {
      Text a = new Text();
    Path[] dates = new Path[0];
    public void configure(JobConf conf) {

    try {
            dates = DistributedCache.getLocalCacheFiles(conf);
            String astr = dates.toString();
            a = new Text(astr);

          } catch (IOException ioe) {
            System.err.println("Caught exception while getting cached files: " +   
          StringUtils.stringifyException(ioe));
          }


    }

    @Override
    public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, 
           Reporter reporter) throws IOException {

             String line = value.toString();

             for(Path cacheFile: dates){

                    output.collect(new Text(line), new Text(cacheFile.getName()));

                }



                }


            }

1 个答案:

答案 0 :(得分:0)

请在configure()方法中尝试此操作:

List<String []> lines; 
Path[] files = new Path[0];

public void configure(JobConf conf) {
    lines = new ArrayList<>();
    BufferedReader SW;
    try {
        files = DistributedCache.getLocalCacheFiles(conf);
        SW = new BufferedReader(new FileReader(files[0].toString()));
        String line;
        while ((line = SW.readLine()) != null) {
           lines.add(line.split(",")); //now, each lines entry is a String array, with each element being a column
        }
        SW.close();

    } catch (IOException ioe) {
        System.err.println("Caught exception while getting cached files: " +   
        StringUtils.stringifyException(ioe));
    }
}

这样,您将在变量lines中获得分布式缓存中文件的内容(在本例中为第一个文件)。每个lines条目表示一个String数组,由','分隔。所以第一行的第一列是lines.get(0)[0],第二行的第三行是lines.get(1)[2]等。