将数据插入Hive分区而不覆盖现有数据

时间:2015-07-29 14:40:26

标签: hadoop hive hiveql

假设我有两个本地文件file1.txt和file2.txt。

file1.txt的内容:

1,a
3,c

file2.txt的内容

2,b
4,d

我把这些文件放在Hadoop上就像这样

hadoop fs -rm -r /user/cloudera/repart2/*
hadoop fs -mkdir -p /user/cloudera/repart2/20150401
hadoop fs -put file1.txt /user/cloudera/repart2/20150401/
hadoop fs -mkdir -p /user/cloudera/repart2/20150402
hadoop fs -put file2.txt /user/cloudera/repart2/20150402/

我制作了一个Hive表

# Select a test database
use training;

# Create the table
create external table repart (
col1 int, col2 string)
PARTITIONED BY (Test int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
location '/user/cloudera/repart2';

# Add partititons
ALTER TABLE repart ADD PARTITION (Test='20150401') LOCATION '/user/cloudera/repart2/20150401/';
ALTER TABLE repart ADD PARTITION (Test='20150402') LOCATION '/user/cloudera/repart2/20150402/';

当我做一个选择陈述

select * from repart;

显示

1   a   20150401
3   c   20150401
2   b   20150402
4   d   20150402

我希望我的表最终看起来像这样

1   a   20150401
2   b   20150401
3   c   20150401
4   d   20150401
2   b   20150402
4   d   20150402

但是当我尝试插入查询时

INSERT INTO TABLE repart PARTITION (Test='20150401') select col1, col2 FROM repart where Test = 20150402;

查询使表格看起来像这样。分区20150401中的原始数据已被覆盖。

2   b   20150401
4   d   20150401
2   b   20150402
4   d   20150402

返回“hive --version”命令:0.12.0-cdh5.0.0。我注意到this jira,但我的桌子已全部小写,所以我不确定是什么问题。

1 个答案:

答案 0 :(得分:0)

当我使用Hive 1.1.0-cdh5.4.0时,相同的代码运行没有问题。它一定是破了0.12左右。我将使用新版本。如果有人知道为什么0.12.0会破坏,我仍然会感兴趣。