HIL!
我有一个包含以下内容的文本:
$ hdfs dfs -cat result/
[5,AA,ABE,US,AGU,MX,DNE0M0Z1,99991231,20160421,MX13,706,1,,33,,BOX,,,60,INNJ,31,2419221]
[5,AA,ABE,US,AGU,MX,DNE0M0Z1,99991231,20160421,MX13,706,1,,33,,BOX,,,60,INNJ,31,2419244]
[5,AA,ABE,US,AGU,MX,DNE0M0Z1,99991231,20160421,MX13,706,1,,33,,BOX,,,60,INNJ,31,2419319]
此文件由Spark在HDFS中生成。我想要的是从该文件创建一个表HIVE读取,以便在表中显示结果。问题是记录以[]开头和结尾。我可以这样做而不改变txt,因为它是自动生成的?
现在我的表是:
DROP TABLE IF EXISTS RESULT_LATAM;
CREATE EXTERNAL TABLE IF NOT EXISTS RESULT_LATAM
(
FARDET_NUM_RULE_TARIFF BIGINT,
FARDET_CD_CARRIER VARCHAR(3),
FARDET_CD_ORIGIN_CITY VARCHAR(5),
FARDET_CD_ORIGIN_COUNTRY VARCHAR(2),
FARDET_CD_DEST_CITY VARCHAR(5),
FARDET_CD_DEST_COUNTRY VARCHAR(2),
FARDET_CD_FARE_BASIS VARCHAR(8),
.
.
.
)
STORED AS TEXTFILE
LOCATION '/user/ubuntu/result/';
答案 0 :(得分:0)
没有直接的方法来实现这一点,但为了演示我使用较少数量的列的解决方案,但您会得到一个想法。您必须开发自定义EDW
类型的解决方案,您可以在临时表中加载数据并在主表中加载时执行转换/清理:
示例数据:
[5,A1]
[6,A2]
[7,A3]
创建登台表
create external table table_stg(x string,y string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
创建主表
create external table table_main(x int,y VARCHAR(10))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
在登台表中加载数据
LOAD DATA INPATH '/user/cloudera/result.txt' INTO TABLE table_stg;
hive> select * from table_stg;
OK
[5 A1]
[6 A2]
[7 A3]
Time taken: 0.086 seconds, Fetched: 3 row(s)
在主表中加载干净的数据
insert into table table_main
select regexp_replace(x, '\\[',''), regexp_replace(y, '\\]','')
from table_stg;
最终输出
hive> select * from table_main;
OK
5 A1
6 A2
7 A3
Time taken: 0.155 seconds, Fetched: 3 row(s)