Question

我想在hive表上应用归档和清除机制，其中包括内部和外部表以及分区和非分区。

我有一个site_visitors表，并使用visit_date进行分区。我想存档site_visitors表数据，用户在过去一年没有访问过我的网站。同时，我不想将此存档数据保存在同一个表目录中。我可以将存档数据存放在某个特定位置。

Answer 1

你可以在HDFS目录的分区上处理它，下面是你可以实现的方法之一。

您的内部表/主表将位于hdfs之上，目录将类似于hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-01 hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-02 hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-03下面的

您可以在HDFS之上创建存档表，或者如果您只想存档数据，可以将分区转储到HDFS中的其他位置。无论哪种方式，您的HDFS位置将如下所示。

hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-01 hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-02 hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-03

您可以运行UNIX脚本或javascript或环境中使用的任何其他语言，根据分区日期将文件从一个HDFS位置移动到另一个存档hdfs位置。

您也可以使用以下方法，您可以将数据加载到存档表中并将数据放入原始表中。

#!bin/bash
ARCHIVE=$1
now=$(date +%Y-%m-%d) 
StartDate=$now
#archive_dt will give a date based on the ARCHIVE date and that be will used for alterations and loading
archive_dt=$(date --date="${now} - ${ARCHIVE} day" +%Y-%m-%d)
EndDate=$archive_dt
#You can use hive or beeline or impala to insert the data into archive table, i'm using beeline for my example
beeline -u ${CONN_URL} -e "insert into table ${SCHEMA}.archive_table partition (visit_date) select * from ${SCHEMA}.${TABLE_NAME} where visit_date < ${archive_dt}"
#After the data been loaded to the archive table i can drop the partitions in original table
beeline -u ${CONN_URL} -e "ALTER TABLE ${SCHEMA}.main_table DROP PARTITION(visit_date < ${archive_dt})"
#Repair the tables to sync the metadata after alterations
beeline -u ${CONN_URL} -e "MSCK REPAIR TABLE ${SCHEMA}.main_table; MSCK REPAIR TABLE archiveSchema.archive_table"

Hive表存档

1 个答案: