识别并消除配置单元中的重复记录

时间:2015-05-20 07:38:41

标签: mysql hive hiveql

以下是我正在处理的数据库中的示例:

ID  Issue   Sub Issue   Creation Time            Solved Time
1   A        A1        01-05-2015 00:10:10       01-05-2015 10:20:00
2   B        B1        01-05-2015 00:10:55       01-05-2015 10:30:30
3   A        A2        01-05-2015 00:11:30       02-05-2015 08:10:45
4   A        A1        01-05-2015 00:14:45       01-05-2015 10:25:00
5   D        D4        02-05-2015 13:10:00          NULL
6   B        B1        02-05-2015 00:14:35          NULL

我想识别具有相同问题的ID,子项和创建时间< = 5分钟作为重复的ID并消除它们。虽然消除了,如果两者都有一个已解决的时间戳或没有一个已解决的时间戳,我可以选择其中一个。否则,我选择一个具有Solved Timestamp值的那个。

Ex:1& 4,2& 6是此示例中的重复ID。我删除1和6

有人可以帮我处理Hive / SQL查询。

1 个答案:

答案 0 :(得分:0)

2和6不重复,因为有不同的日期,时差超过23小时。

我在您的示例数据上测试了此解决方案,它运行正常:

select id, issue, sub_issue, creation_time, solved_time from ( --calculate is_duplicate_flag for all rows select case when ((unix_timestamp(next_creation_time,'dd-MM-yyyy hh:mm:ss')-unix_timestamp(creation_time,'dd-MM-yyyy hh:mm:ss'))/60 <=5) or ((unix_timestamp(creation_time,'dd-MM-yyyy hh:mm:ss')-unix_timestamp(prev_creation_time,'dd-MM-yyyy hh:mm:ss'))/60 <=5) then true else false end as is_duplicate_flag, s.* from (select t.*, lead(t.creation_time) over(partition by t.issue, t.sub_issue order by unix_timestamp(t.creation_time,'dd-MM-yyyy hh:mm:ss')) as next_creation_time, lag(t.creation_time) over(partition by t.issue, t.sub_issue order by unix_timestamp(t.creation_time,'dd-MM-yyyy hh:mm:ss') ) as prev_creation_time, row_number() over(partition by t.issue, t.sub_issue order by case when t.solved_time is not null then 1 else 2 end, unix_timestamp(t.creation_time,'dd-MM-yyyy hh:mm:ss') desc) as rn
from (select 1 as id, 'A' as issue, 'A1' as sub_issue, '01-05-2015 00:10:10' as creation_time, '01-05-2015 10:20:00' as solved_time from default.dual union all select 2 as id, 'B' as issue, 'B1' as sub_issue, '01-05-2015 00:10:55' as creation_time, '01-05-2015 10:30:30' as solved_time from default.dual union all select 3 as id, 'A' as issue, 'A2' as sub_issue, '01-05-2015 00:11:30' as creation_time, '02-05-2015 08:10:45' as solved_time from default.dual union all select 4 as id, 'A' as issue, 'A1' as sub_issue, '01-05-2015 00:14:45' as creation_time, '01-05-2015 10:25:00' as solved_time from default.dual union all select 5 as id, 'D' as issue, 'D4' as sub_issue, '02-05-2015 13:10:00' as creation_time, NULL as solved_time from default.dual union all select 6 as id, 'B' as issue, 'B1' as sub_issue, '02-05-2015 00:14:35' as creation_time, NULL as solved_time from default.dual )t ) s )s where case when ! is_duplicate_flag then 1 else rn end =1 order by id

结果:

id issue sub_issue creation_time solved_time 2 B B1 01-05-2015 00:10:55 01-05-2015 10:30:30 3 A A2 01-05-2015 00:11:30 02-05-2015 08:10:45 4 A A1 01-05-2015 00:14:45 01-05-2015 10:25:00 5 D D4 02-05-2015 13:10:00 NULL 6 B B1 02-05-2015 00:14:35 NULL