Question

我有两张表tabl1：

+-------+--------+--------+----------+
| att1  |  att2  | att3   | att4     |
+-------+--------+--------+----------+
|  abcd | ava012 | df012f | afsdaldf |
.......

和tabl2：

+----+
| val|
+----+
| 012|
...

tabl2包含tabl1的4列中的一列或多列中的子字符串。这两个表都是包含数百万条记录的大表。我尝试连接tabl1列并在其中搜索，但查询永远不会结束。有没有一种有效的方法来做到这一点。也许将整个表转换为一个txt文件并在其中搜索？还关注this question 以下是我的试验的一些例子（都在Hive中）：

SELECT a.*, b.*
from tabl1 a, tabl2 b
where  
instr (
concat ( (cast (a.att1 as string), (cast (a.att2 as string), 
(cast (a.att3 as string), (cast (a.att4 as string) ) , (cast (b.val as string) ) ) > 0

或

  SELECT a.*, b.*
    from tabl1 a, tabl2 b
    where  
    concat ( (cast (a.att1 as string), (cast (a.att2 as string), 
(cast (a.att3 as string), (cast (a.att4 as string) ) 
like  concat ('%',(cast (b.val as string),'%')

还有一些REGEX，但无休止的运行时......

Answer 1

select  *

from           (select  *
                from    tabl1 t1
                        lateral view explode(split(regexp_replace(trim(regexp_replace(concat_ws(',',att1,att2,att3,att4),'\\D+',' ')),'(?<=^| )(?<token>.*?) (?=.*(?<= )\\k<token>(?= |$))',''),' ')) e as val
                ) t1

        join    tabl2 t2

        on      t2.val = 
                t1.val

使用hive / impala或其他方式通过子字符串连接大表的有效方法

1 个答案: