猪的REGEX_EXTRACT不起作用

时间:2017-07-10 09:14:09

标签: regex hadoop apache-pig

我想从每个字段的开头和结尾删除双引号“”。 我正在尝试在猪中使用regexp,但似乎它不起作用

输入:

(main_170521230001.csv,"9","2017-05-21 23:00:01.472636")
(main_170521230001.csv,"91","2017-05-21 23:00:01.472636")
(main_170521230001.csv,"592","2017-05-21 23:00:01.472636")

猪脚本:

raw = LOAD '/data/csv' using PigStorage(',','-tagFile') as (
  fn:chararray,
  gid:chararray,
  createdts:chararray);

res = foreach raw generate
        REGEX_EXTRACT(fn, '([^"](.*)[^"])',1) as (fn:chararray),
        REGEX_EXTRACT(gid, '([^"](.*)[^"])',1) as (gid:chararray),
        REGEX_EXTRACT(createdts, '([^"](.*)[^"])',1) as (createdts:chararray);

dump res;

输出:

(ain_170521230001.cs,,017-05-21 23:00:01.47263)
(ain_170521230001.cs,91,017-05-21 23:00:01.47263)
(ain_170521230001.cs,592,017-05-21 23:00:01.47263)

我预计:

(main_170521230001.csv,9,2017-05-21 23:00:01.472636)
(main_170521230001.csv,91,2017-05-21 23:00:01.472636)
(main_170521230001.csv,592,2017-05-21 23:00:01.472636)

我希望收到“”之间的所有字符。 例子:

"abc" -> abc
abc -> abc
""abc""" -> abc
"a"b"c" -> a"b"c

这就是我使用这种模式的原因:

'([^"](.*)[^"])'

它工作正常,除了一种情况 - 如果双引号之间只有一个字符,则此模式返回空字符串 为什么会这样呢?

1 个答案:

答案 0 :(得分:0)

将数据加载到单个字段中并使用REPLACE。然后您可以使用STRSPLIT获取各个字段。

raw = LOAD '/data/csv' USING TextLoader();
res = foreach raw generate REPLACE($0,"\\"",'');
res_new = foreach res generate STRSPLIT($0,',',3);
dump res_new;
相关问题