Pentaho执行SQL脚本来插入数据

时间:2013-08-08 13:42:40

标签: mysql pentaho missing-data data-integration

我正在制作一份报告,该报告将使用导入的数据提供缺失序列列表:

 CREATE  TABLE `client_trans` 
 (
   `id` INT NOT NULL AUTO_INCREMENT,
   `client_id` INT NULL,
   `sequence` INT NULL,
   `other_data` INT NULL,
   PRIMARY KEY (`id`),
   INDEX `client_id_seq` (`client_id` ASC, `sequence` ASC) 
 );

除了id字段外,没有真正唯一的值,甚至没有值的组合

此表的数据如下所示(忽略other_data字段):

id  client_id sequence
1   1000      1
2   1000      2
3   1000      2
4   1000      3
5   1001      1
6   1001      5
7   1001      6
8   1002      4
9   1002      6

如上例所示,可能有多个client_id / sequence组合,序列可能不是从1(也不是0)开始

虽然可以运行查询以查找缺失的序列,例如the answer to this question上的变体,但这可能会花费很长时间

此方法的替代方法是在将数据插入表(使用Pentaho数据集成工具)之前或同时执行一些插入/更新查询,并使用包含缺少的client_id / sequence值的附加表。这意味着在上面的示例中,当插入(client_id,sequence)值(1001,5)时,将使用类似我在下面找到的查询之类的内容来拾取序列2-4丢失:

CREATE TABLE `missing_sequences` (
  `client_id` int(11),
  `miss_start` int(11),
  `miss_end` int(11),
) 

(注意,为了更容易在SQL中测试查询而不是Pentaho执行SQL语句,插入被注释掉,以便它只是一个选择)

SET @temp_id = 1001;    
SET @temp_seq = 5;
/* Replace temp_id, temp_seq references with ? in Pentaho */
/* INSERT INTO missing_sequences (id, miss_start, miss_end) */
SELECT @temp_id id, max(t1.seq) + 1 missing_start, @temp_seq - 1 missing_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.id = @temp_id
  AND t1.seq < @temp_seq
  AND t2.id = @temp_id
  AND t2.seq >= @temp_seq - 1
HAVING missing_end >= missing_start

结果:

id       missing_start        missing_end
1001     2                    4

这将成功地填充缺失的序列表,但是当添加包含以前缺失的序列之一的行时会出现问题。
(最初我还有基于client_id和miss_start的主索引,它也会处理添加的重复值,但不完全确定这是否正确)

根据插入的序列号存在四种可能性之一,例如:

@temp_seq = missing_start : (@temp_seq = 2) 
    update missing_start += 1
missing_start < @temp_seq < missing_end : (@temp_seq = 3)
    split into two records
@temp_seq = missing_end : (@temp_seq = 4)
    update missing_end -= 1
@temp_seq = missing_start = missing_end : (@temp_id = 1002, @temp_seq = 5)
    delete record from missing_sequences table

现在我的问题出现了(如果您考虑到导入的数据可能没有排序,则更早): 我如何满足Pentaho数据集成转换中的每种可能性以及初始插入和重复?

编辑:经过一番头脑风暴后,我想出了以下在MySQL中运行它时似乎正在运行的脚本,但是当它作为“执行SQL语句”触发器运行时却没有。这是在(client_id,missing_start)的missing_sequences表上有一个主索引:

SET @orig_start = 0;
SET @orig_end = 0;

SET @temp_client_id = ?;
SET @temp_sequence = ?;

/* Find closest matching record and save start/end values*/
SELECT client_id, @orig_start:=miss_start miss_start, @orig_end:=miss_end miss_end
FROM missing_sequences 
WHERE client_id = @temp_client_id
  AND miss_start <= @temp_sequence
  AND miss_end >= @temp_sequence
LIMIT 1; /* Just in case, delete all matches later anyway */

/* Delete the above record if exists */
DELETE FROM missing_sequences
WHERE client_id = @temp_client_id AND miss_start = @orig_start AND miss_end = @orig_end;

/* Insert new value. This will insert the FIRST value in the table
   eg. if 1-10 is missing and 5 inserted, this will insert 1-4 as missing */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_start := max(t1.sequence) + 1 miss_start, @curr_end := @temp_sequence - 1 miss_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.client_id = @temp_client_id
  AND t1.sequence < @temp_sequence
  AND t2.client_id = @temp_client_id
  AND t2.sequence >= @temp_sequence - 1
HAVING miss_end >= miss_start
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;

/* Insert upper missing value if it is different */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_end + 2 missing_start, @orig_end missing_end
FROM dual
WHERE @curr_end + 2 <= @orig_end
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;

对每一行执行并检查变量替换框,但执行似乎不一致或根本不更新缺失的序列表

0 个答案:

没有答案