在Java Java API中将JavaPairRDD转换为Dataframe

时间:2017-05-24 22:59:59

标签: java apache-spark spark-dataframe rdd java-pair-rdd

我在Java 7中使用Spark 1.6

我有一对RDD:

JavaPairRDD<String, String> filesRDD = sc.wholeTextFiles(args[0]);

我想将其转换为带有架构的DataFrame

似乎首先我必须将pairRDD转换为RowRDD。

那么如何从PairRDD创建RowRdd?

2 个答案:

答案 0 :(得分:3)

对于Java 7,您需要定义一个地图功能

public static final Function<Tuple2<String, String>,Row> mappingFunc = (tuple) -> {
    return RowFactory.create(tuple._1(),tuple._2());
};

现在您可以调用此函数来获取JavaRDD<Row>

JavaRDD<Row> rowRDD = filesRDD.map(mappingFunc);

使用Java 8就像

一样
JavaRDD<Row> rowRDD = filesRDD.map(tuple -> RowFactory.create(tuple._1(),tuple._2()));

从JavaPairRDD获取Dataframe的另一种方法是

DataFrame df = sqlContext.createDataset(JavaPairRDD.toRDD(filesRDD), Encoders.tuple(Encoders.STRING(),Encoders.STRING())).toDF();

答案 1 :(得分:0)

以下是一种可以实现此目的的方法。

    //Read whole files
    JavaPairRDD<String, String> pairRDD = sparkContext.wholeTextFiles(path);

    //create a structType for creating the dataframe later. You might want to
    //do this in a different way if your schema is big/complicated. For the sake of this
    //example I took a simple one.
    StructType structType = DataTypes
            .createStructType(
                    new StructField[]{
                            DataTypes.createStructField("id", DataTypes.StringType, true)
                            , DataTypes.createStructField("name", DataTypes.StringType, true)});


    //create an RDD<Row> from pairRDD
    JavaRDD<Row> rowJavaRDD = pairRDD.values().flatMap(new FlatMapFunction<String, Row>() {
        public Iterable<Row> call(String s) throws Exception {
            List<Row> rows = new ArrayList<Row>();
            for (String line : s.split("\n")) {
                String[] values = line.split(",");
                Row row = RowFactory.create(values[0], values[1]);
                rows.add(row);
            }
            return rows;
        }
    });


    //Create Dataframe.
    sqlContext.createDataFrame(rowJavaRDD, structType);

我使用的示例数据
File1:

1, john  
2, steve

File2:

3, Mike  
4, Mary  
来自df.show()的

输出:

+---+------+
| id|  name|
+---+------+
|  1|  john|
|  2| steve|
|  3|  Mike|
|  4|  Mary|
+---+------+