Pig中的Python UDF不返回记录

时间:2016-05-09 17:11:24

标签: python apache-pig

我正在将一个包传递给python并希望从Python udf获取一条记录。我必须在Outputschema中做错事,并最终将每列作为元组。非常感谢任何帮助。

猪代码:

REGISTER 'priority.py' using jython as callme
A = LOAD 'addr_input/addr.dat' USING PigStorage(',') AS (A : chararray, B   :chararray , C:  chararray , ID :  chararray,  ID_TYPE :  chararray);

B = DISTINCT A;
Z=  GROUP B BY (A,B,C);
O = FOREACH Z GENERATE callme.unique_list($1) ;
DUMP O;

Python代码:

@outputSchema('relationships:{t:(A : chararray, B :chararray , C:  chararray , ID :  chararray,  ID_TYPE :  chararray)}')
def unique_list(input):
my_list = list(input)
print(my_list)
last_list = []
zipcnt = -1
citicnt = -1
countycnt = -1
statecnt = -1
return_list_zip = []
return_list_city = []
return_list_county = []
return_list_state = []
return_list_country = []
for j in range(len(my_list)):
      if(my_list[j][4]) == "zip":
         zipcnt = len(my_list)
         return_list_zip = list(my_list[j])
         continue
      elif (my_list[j][4] == 'city' and zipcnt == -1):
          citicnt = len(my_list)
          return_list_city = list(my_list[j])
          continue

      elif (my_list[j][4] == 'county' and zipcnt == -1 and citicnt == -1):
          countycnt = len(my_list)
          return_list_county = list(my_list[j])
          continue
      elif (my_list[j][4]  == 'state'and zipcnt == -1 and citicnt == -1 and countycnt == -1):
          statecnt = len(my_list)
          return_list_state = list(my_list[j])
          continue
      elif (my_list[j][4] == 'country'and zipcnt == -1 and citicnt == -1 and countycnt == -1 and statecnt == -1):
          return_list_country = list(my_list[j])
          continue
if(zipcnt != -1):
    return_list = return_list_zip
elif(citicnt != -1):
    return_list = return_list_city
elif(countycnt != -1):
    return_list = return_list_county
elif (statecnt != -1):
    return_list = return_list_state
else:
    return_list = return_list_country
return return_list

我得到的输出:

({(aa),(bb),(cc),(1),(zip)})
({(lll),(ccc),(ddd),(6),(city)})
({(lll),(ccc),(xxx),(7),(country)})
({(mmm),(nnn),(cc),(4),(zip)})

---每一栏都是以元组形式出现的!

我期待的输出:

  {aa,bb,cc,1,zip}
  {lll,ccc,ddd,6,city}

非常感谢您的帮助。

2 个答案:

答案 0 :(得分:0)

包是元组的集合。你得到的答案是有效的。

答案 1 :(得分:0)

尝试

A = LOAD 'addr_input/addr.dat' USING PigStorage(',') AS (A : chararray, B   :chararray , C:  chararray , ID :  chararray,  ID_TYPE :  chararray);

B = DISTINCT A;
Z=  GROUP B BY (A,B,C);
DESCRIBE Z
O = foreach Z { f1 = foreach B generate $0,$1,$2,$3,$4; generate flatten(f1);}
dump O

对于你给出的输入,这是输出

(aa,bb,cc,1,zip)
(aa,bb,cc,2,street)
(lll,ccc,ddd,6,city)
(lll,ccc,xxx,7,country)
(mmm,nnn,cc,3,county)
(mmm,nnn,cc,4,zip)
(mmm,nnn,cc,5,state)

这是你在找什么?