用于从url中提取关键字的Hive正则表达式

时间:2018-04-20 22:57:18

标签: regex hive hiveql regex-negation regex-lookarounds

文件名如下:

  1. file:///storage/emulated/0/SHAREit/videos/Dangerous_Hero_(2017)____Latest_South_Indian_Full_Hindi_Dubbed_Movie___2017_.mp4

  2. file:///storage/emulated/0/VidMate/download/%E0%A0_-_Promo_Songs_-_Khiladi_-_Khesari_Lal_-_Bho.mp4 file:///storage/emulated/0/WhatsApp/Media/WhatsApp%20Video/VID-20171222-WA0015.mp4

  3. file:///storage/emulated/0/bluetooth/%5DChitaChola%7B%7D%D8%B9%D8%A7%D9%85%D8%B1%24%20.3gp

  4. 我想编写hive正则表达式来从每个字符串中提取单词。

    例如在第一个字符串中输出应该是:storage,emulated,....

    更新

    此代码给了我结果,但我想要正则表达式而不是代码。

    package uri_keyword_extractor;
    
    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.io.Text;
    
    import java.util.ArrayList;
    
    public class UDFUrlKeywordExtractor extends UDF {
        private  Text result = new Text();
    
        public  Text evaluate(Text url) {
            if (url == null) {
                return null;
            }
            String keywords = url_keyword_maker(url.toString());
            result.set(keywords);
            return result;
        }
    
        private static String url_keyword_maker(String url) {
            // TODO Auto-generated method stub
            ArrayList<String> keywordAr = new ArrayList<String>();
            char[] charAr = url.toCharArray();
            for (int i = 0; i < charAr.length; i++) {
                int current_index = i;
                // check if character is a-z or A-Z
                char ch = charAr[i];
                StringBuilder sb = new StringBuilder();
                while (current_index < charAr.length-1 && isChar(ch)) {
                    sb.append(ch);
                    current_index = current_index+1;
                    ch = charAr[current_index];
                }
                String word = sb.toString();
                if (word.length() >= 2) {
                    keywordAr.add(word);
                }
                i = current_index;
            }
            //
            StringBuilder sb = new StringBuilder();
            for(int i =0; i < keywordAr.size();i++) {
                String current = keywordAr.get(i);
                sb.append(current);
                if(i < keywordAr.size() -1) {
                    sb.append(",");
                }
            }
            return sb.toString();
        }
    
        private static  boolean isChar(char ch) {
            // TODO Auto-generated method stub
            int ascii_value = (int) ch;
            // A-Z => (65,90) a-z => (97,122)
            // condition 1 : A-Z , condition 2 : a-z character check
            if (  (ascii_value >= 65 && ascii_value <= 90)  ||  (ascii_value >= 97 && ascii_value <= 122) ) {
                return true;
            } else {
                return false;
            }
        }
    
        public static void main(String[] args) {
            // TODO Auto-generated method stub
            String test1 = "file:///storage/emulated/0/SHAREit/videos/Dangerous_Hero_(2017)____Latest_South_Indian_Full_Hindi_Dubbed_Movie___2017_.mp4";
            String test2 = "file:///storage/emulated/0/VidMate/download/%E0%A0_-_Promo_Songs_-_Khiladi_-_Khesari_Lal_-_Bho.mp4";
            String test3 = "file:///storage/emulated/0/bluetooth/%5DChitaChola%7B%7D%D8%B9%D8%A7%D9%85%D8%B1%24%20.3gp";
            System.out.println(url_keyword_maker(test1).toString());
            System.out.println(url_keyword_maker(test2).toString());
            System.out.println(url_keyword_maker(test3).toString());
        }
    }
    

1 个答案:

答案 0 :(得分:0)

使用split(str, regex_pattern)函数,它使用正则表达式作为分隔符模式拆分str并返回数组。然后使用lateral view + epxlode按照Java代码中的长度来爆炸数组和过滤关键字。然后应用collect_set重新组合关键字数组+ concat_ws(delimeter, str)函数,以便在必要时将数组转换为分隔字符串。 我传递给split函数的正则表达式是'[^a-zA-Z]'

演示:

select url_nbr, concat_ws(',',collect_set(key_word)) keywords from
(--your URLs example, url_nbr here is just for reference
select 'file:///storage/emulated/0/SHAREit/videos/Dangerous_Hero_(2017)____Latest_South_Indian_Full_Hindi_Dubbed_Movie___2017_.mp4' as url, 1 as url_nbr union all
select 'file:///storage/emulated/0/VidMate/download/%E0%A0_-_Promo_Songs_-_Khiladi_-_Khesari_Lal_-_Bho.mp4' as url, 2 as url_nbr union all
select 'file:///storage/emulated/0/WhatsApp/Media/WhatsApp%20Video/VID-20171222-WA0015.mp4' as url, 3 as url_nbr union all
select 'file:///storage/emulated/0/bluetooth/%5DChitaChola%7B%7D%D8%B9%D8%A7%D9%85%D8%B1%24%20.3gp' as url, 4 as url_nbr)s
lateral view explode(split(url, '[^a-zA-Z]')) v as key_word
where length(key_word)>=2 --filter here
group by url_nbr
;

输出:

OK
1       file,storage,emulated,SHAREit,videos,Dangerous,Hero,Latest,South,Indian,Full,Hindi,Dubbed,Movie,mp
2       file,storage,emulated,VidMate,download,Promo,Songs,Khiladi,Khesari,Lal,Bho,mp
3       file,storage,emulated,WhatsApp,Media,Video,VID,WA,mp
4       file,storage,emulated,bluetooth,DChitaChola,gp
Time taken: 37.767 seconds, Fetched: 4 row(s)

也许我错过了你的java代码,但希望你已经抓住了这个想法,所以你可以轻松地修改我的代码并在必要时添加额外的处理。