Question

我有一个分区表。来自2017-06-20及以上的分区。

我的查询。

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val test_enc_orc = hiveContext.sql("select * from db.tbl where time_key = '2017-06-21' limit 1")

每次运行时，spark都会查找此分区2017-06-20

INFO OrcFileOperator: ORC file hdfs://nameservice1/apps/hive/warehouse/db.db/tbl/time_key=2017-06-20/000016_0 has empty schema, it probably contains no rows. Trying to read another ORC file to figure out the schema.

并搜索日期2017-06-20的所有文件。它包含空的ORC文件。但是分区2017-06-21包含带有数据的文件。为什么不引发搜索日期或其他任何内容？

修改

创建测试表

drop table arstel.evkuzmin_test_it;

create table arstel.evkuzmin_test_it(name string)
partitioned by(ban int)
stored as orc;

insert into arstel.evkuzmin_test_it partition(ban) values
("bob", 1)
, ("marty", 1)
, ("monty", 2)
, ("naruto", 2)
, ("death", 4);

似乎问题正是因为空文件。在这种情况下没有，所以一切正常。有没有办法让火花忽略它们？

使用spark阅读配置单元orc表

0 个答案: