Question

如何使用Java将Orc文件的ColumnStatistics与模式（TypeDescription）中定义的列名链接？

    Reader reader = OrcFile.createReader(ignored);
    TypeDescription schema = reader.getSchema();
    ColumnStatistics[] stats = reader.getStatistics();

列统计信息包含平面数组中所有列类型的统计信息。但是，该模式是模式树。列统计信息是该模式的树遍历（深度优先吗？）？

我尝试使用orc-statistics，但仅输出列ID。

Answer 1

找出与DFS遍历架构匹配的文件统计信息。遍历包括不包含数据的中间架构，例如Struct和List。此外，遍历包括整体架构作为第一节点。 Orc Specification v1的文档对此进行了解释：

通过预遍历将类型树展平到列表中，在每个遍历中为每种类型分配下一个ID。显然，类型树的根始终是类型id0。复合类型具有一个名为subtypes的字段，其中包含其子类型ID的列表。

从Orc TypeDescription获取平展的模式名称列表的完整代码：

final class OrcSchemas {
  private OrcSchemas() {}

  /**
   * Returns all schema names in a depth-first traversal of schema.
   *
   * <p>The given schema is represented as '<ROOT>'. Intermediate, unnamed schemas like
   * StructColumnVector and ListColumnVector are represented using their category, like:
   * 'parent::<STRUCT>::field'.
   *
   * <p>This method is useful because some Orc file methods like statistics return all column stats
   * in a single flat array. The single flat array is a depth-first traversal of all columns in a
   * schema, including intermediate columns like structs and lists.
   */
  static ImmutableList<String> flattenNames(TypeDescription schema) {
    if (schema.getChildren().isEmpty()) {
      return ImmutableList.of();
    }
    ArrayList<String> names = Lists.newArrayListWithExpectedSize(schema.getChildren().size());
    names.add("<ROOT>");
    mutateAddNamesDfs("", schema, names);
    return ImmutableList.copyOf(names);
  }

  private static void mutateAddNamesDfs(
      String parentName, TypeDescription schema, List<String> dfsNames) {
    String separator = "::";
    ImmutableList<String> schemaNames = getFieldNames(parentName, schema);
    ImmutableList<TypeDescription> children = getChildren(schema);
    for (int i = 0; i < children.size(); i++) {
      String name = schemaNames.get(i);
      dfsNames.add(name);
      TypeDescription childSchema = schema.getChildren().get(i);
      mutateAddNamesDfs(name + separator, childSchema, dfsNames);
    }
  }

  private static ImmutableList<TypeDescription> getChildren(TypeDescription schema) {
    return Optional.ofNullable(schema.getChildren())
        .map(ImmutableList::copyOf)
        .orElse(ImmutableList.of());
  }

  private static ImmutableList<String> getFieldNames(String parentName, TypeDescription schema) {
    final List<String> names;
    try {
      // For some reason, getFieldNames doesn't handle null.
      names = schema.getFieldNames();
    } catch (NullPointerException e) {
      // If there's no children, there's definitely no field names.
      if (schema.getChildren() == null) {
        return ImmutableList.of();
      }
      // There are children, so use the category since there's no names. This occurs with
      // structs and lists.
      return schema.getChildren().stream()
          .map(child -> parentName + "<" + child.getCategory() + ">")
          .collect(toImmutableList());
    }
    return names.stream().map(n -> parentName + n).collect(toImmutableList());
  }
}

将Apache Orc文件列名与列统计信息匹配

1 个答案: