Question

在我们的代码中，Dataframe创建为：

DataFrame DF = hiveContext.sql("select * from table_instance");

当我将我的数据帧转换为rdd并尝试将其分区数量设为

时

RDD<Row> newRDD = Df.rdd();
System.out.println(newRDD.getNumPartitions());

它将分区数减少为1（在控制台中打印1）。最初我的数据帧有102个分区。

更新

在阅读时，我重新分配了数据框：

DataFrame DF = hiveContext.sql("select * from table_instance").repartition(200);

然后转换为rdd，所以它只给了我200个分区。

JavaSparkContext

在这方面有作用吗？当我们将数据帧转换为rdd时，是否还在spark上下文级别考虑了默认的最小分区标志？

更新：

我制作了一个单独的示例程序，在其中我将完全相同的表读入数据帧并转换为rdd。没有为RDD转换创建额外的阶段，分区计数也是正确的。我现在想知道我在主程序中做了什么不同。

如果我的理解是错误的，请告诉我。

Answer 1

它主要取决于hiveContext.sql()的实施。由于我是Hive的新手，我的猜测是hiveContext.sql不知道或者无法分割表中的数据。

例如，当您从HDFS读取文本文件时，spark context会考虑该文件用于确定分区的块数。

你对repartition所做的是这些问题的明显解决方案。（注意：如果不使用正确的分区程序，重新分区可能会导致混乱操作，默认使用散列分区程序）

令您怀疑，hiveContext可能会考虑默认的最小分区属性。但是，依靠默认属性是不会的解决你所有的问题。例如，如果你的hive表的大小增加，你的程序仍然使用默认的分区数。

更新：在重新分区期间避免随机播放

定义自定义分区程序：

public class MyPartitioner extends HashPartitioner {
    private final int partitions;
    public MyPartitioner(int partitions) {
        super();
        this.partitions = partitions;
    }
    @Override
    public int numPartitions() {
        return this.partitions;
    }

    @Override
    public int getPartition(Object key) {
        if (key instanceof String) {
            return super.getPartition(key);
        } else if (key instanceof Integer) {
            return (Integer.valueOf(key.toString()) % this.partitions);
        } else if (key instanceof Long) {
            return (int)(Long.valueOf(key.toString()) % this.partitions);
        }
        //TOD ... add more types
    }
}

使用自定义分区程序：

JavaPairRDD<Long, SparkDatoinDoc> pairRdd = hiveContext.sql("select *   from table_instance")
.mapToPair( //TODO ... expose the column as key)

rdd = rdd.partitionBy(new MyPartitioner(200));
//... rest of processing

将Dataframe转换为RDD会减少分区

1 个答案: