Pig - 加载具有不同模式的多个文件

时间:2014-03-25 18:35:02

标签: apache-pig

我正在使用pig加载以逗号分隔的文件/文件夹hadoop范围。(this question on how to load multiple files in pig

问题是每个文件夹都有不同的架构文件(位于文件夹的一侧) - 是否可以同时提供多个架构文件?

1 个答案:

答案 0 :(得分:1)

如果您的架构文件位于文件夹之外,则必须在执行加载时声明架构。

例如:

dataset_A = LOAD '/data/A' using PigStorage('\t') as (id:int, project:chararray, org:chararray); 
dataset_B = LOAD '/data/B' using PigStorage(',') as (id:int, beta:chararray, delta:chararray, echo:int);



如果在目录中的.pig_schema文件中有声明的模式,则只需执行加载,而不必声明模式。

dataset_A = LOAD '/data/A' using PigStorage('\t'); 
dataset_B = LOAD '/data/B' using PigStorage(',');



/data/A/.pig_schema:

{"fields":
    [{"name":"id","type":10,"description":"autogenerated from Pig Field Schema","schema":null},
    {"name":"project","type":55,"description":"autogenerated from Pig Field Schema","schema":null},
    {"name":"org","type":55,"description":"autogenerated from Pig Field Schema","schema":null}],
    "version":0,"sortKeys":[],"sortKeyOrders":[]}



/data/B/.pig_schema:

{"fields":
[{"name":"id","type":10,"description":"autogenerated from Pig Field Schema","schema":null},
{"name":"beta","type":55,"description":"autogenerated from Pig Field Schema","schema":null},
{"name":"delta","type":55,"description":"autogenerated from Pig Field Schema","schema":null},
{"name":"echo","type":10,"description":"autogenerated from Pig Field Schema","schema":null},],
"version":0,"sortKeys":[],"sortKeyOrders":[]}