编写hadoop序列文件

时间:2014-12-22 07:21:16

标签: hadoop mahout

我有一个文本文件,其中包含以下(键,值)格式编写的数据:

1,34
5,67
8,88

该文件放在本地文件系统中。

我想将它转换为一个hadoop序列文件,再次在本地文件系统上,以便在mahout中使用它。序列文件应该包含所有记录。例如,对于记录1,1是键,34是值。其他记录也是如此。

我是Java新手。我将不胜感激。

感谢。

1 个答案:

答案 0 :(得分:0)

我确实找到了一条路。这是代码:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

public class CreateSequenceFile {
    public static void main(String[] argsx) throws FileNotFoundException, IOException 
      {
       String myfile = "/home/ashokharnal/keyvalue.txt";
       String outputseqfile =  "/home/ashokharnal/part-0000";
       Path path = new Path(outputseqfile);

       //open input file
       BufferedReader br = new BufferedReader(new FileReader(myfile));
       //create Sequence Writer
       Configuration conf = new Configuration();        
       FileSystem fs = FileSystem.get(conf);
       SequenceFile.Writer writer = new SequenceFile.Writer(fs,conf,path,LongWritable.class,Text.class);
       LongWritable key ; 
       Text value ;
       String line = br.readLine();
       String field_delimiter = ",";
       String[] temp;
       while (line != null) {
          try
           {
               temp = line.split(field_delimiter);
               key = new LongWritable(Integer.valueOf(temp[0]))  ;
               value = new Text(temp[1].toString());
               writer.append(key,value);    
               System.out.println("Appended to sequence file key " + key.toString() + " and value " + value.toString());
               line = br.readLine();    
           }
           catch(Exception ex)
           {
              ex.printStackTrace();
           }
      }        
    writer.close();
}
}