我想从每个字段的开头和结尾删除双引号“”。 我正在尝试在猪中使用regexp,但似乎它不起作用
输入:
(main_170521230001.csv,"9","2017-05-21 23:00:01.472636")
(main_170521230001.csv,"91","2017-05-21 23:00:01.472636")
(main_170521230001.csv,"592","2017-05-21 23:00:01.472636")
猪脚本:
raw = LOAD '/data/csv' using PigStorage(',','-tagFile') as (
fn:chararray,
gid:chararray,
createdts:chararray);
res = foreach raw generate
REGEX_EXTRACT(fn, '([^"](.*)[^"])',1) as (fn:chararray),
REGEX_EXTRACT(gid, '([^"](.*)[^"])',1) as (gid:chararray),
REGEX_EXTRACT(createdts, '([^"](.*)[^"])',1) as (createdts:chararray);
dump res;
输出:
(ain_170521230001.cs,,017-05-21 23:00:01.47263)
(ain_170521230001.cs,91,017-05-21 23:00:01.47263)
(ain_170521230001.cs,592,017-05-21 23:00:01.47263)
我预计:
(main_170521230001.csv,9,2017-05-21 23:00:01.472636)
(main_170521230001.csv,91,2017-05-21 23:00:01.472636)
(main_170521230001.csv,592,2017-05-21 23:00:01.472636)
我希望收到“”之间的所有字符。 例子:
"abc" -> abc
abc -> abc
""abc""" -> abc
"a"b"c" -> a"b"c
这就是我使用这种模式的原因:
'([^"](.*)[^"])'
它工作正常,除了一种情况 - 如果双引号之间只有一个字符,则此模式返回空字符串 为什么会这样呢?
答案 0 :(得分:0)
将数据加载到单个字段中并使用REPLACE。然后您可以使用STRSPLIT获取各个字段。
raw = LOAD '/data/csv' USING TextLoader();
res = foreach raw generate REPLACE($0,"\\"",'');
res_new = foreach res generate STRSPLIT($0,',',3);
dump res_new;