Question

我在一个Hive表中有数据，并希望将数据加载到另一个hive表中。

源表是reg_logs，它有2个分区，日期和小时。数据每小时加载到此表中。架构是：

CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING, utc_hour STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/raw/reg_logs';

目标表是reg_logs_org我想要做的就是从utc_hour列旁边的reg_logs复制所有数据。

我创建的架构是:(如果我错了，请更正）

CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs_org (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/reg_logs_org';

从reg_logs：

将数据插入reg_logs_org

insert overwrite table reg_logs_org
select id, region_code, sum(count), utc_date
from 
reg_logs
group by 
utc_date, id, region_code

错误消息：

 FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'reg_logs_org'

==

Thank you,
Rio

Answer 1

这是因为您在插入查询中缺少分区信息

  insert overwrite table reg_logs_org PARTITION (utc_date)
  select id, region_code, sum(count), utc_date
  from 
  reg_logs
  group by 
  utc_date, id, region_code

Answer 2

创建表格的副本

CREATE TABLE my_table_backup LIKE my_table;

启用动态分区

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;

复制表格

INSERT OVERWRITE TABLE dim_data_products_cube_backup PARTITION (ds)
SELECT * FROM dim_data_products_cube 
WHERE ds = ds;

如果使用严格模式，则需要where子句。

Answer 3

在某些情况下，您可能需要设置hive.exec.dynamic.partition.mode = nonstrict 能够将数据插入分区表，例如，

CREATE TABLE hivePartitionedTable
(
          c1    int
        , c2    int
        , c3    string
)
PARTITIONED BY (year  int)
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE
;

然后这个INSERT将起作用：

set hive.exec.dynamic.partition.mode=nonstrict;

INSERT INTO hivePartitionedTable PARTITION (year)
VALUES (1,2,'3', 1999);

Answer 4

插入数据时必须使用分区列作为最后一列。 Hive将获取最后一栏中的数据。

所以基于插入查询应该是： -

  insert overwrite table reg_logs_org PARTITION (utc_date)
  select id, region_code, sum(count), utc_date
  from 
  reg_logs
  group by 
  utc_date, id, region_code

来自documentation：

动态分区列必须在SELECT语句的列中以最后指定，并且与它们在PARTITION（）子句中出现的顺序相同

Answer 5

如果源表中的第一个分区为空，这将不起作用，我的意思是源表的第一个分区中没有记录。在这种情况下，我建议在单独的插入脚本中将虚拟记录与第一个分区一起插入，并在以后截断该数据。

使用分区将数据从一个Hive表加载到另一个Hive表

5 个答案: