如何在单个hadoop流工作中处理多个统计信息

时间:2017-02-02 10:45:08

标签: python hadoop mapreduce

我必须从一些输入数据中获得不同的统计数据。这是我想要的映射器

label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record

for line in sys.stdin:
    line = line.strip()
    line2 = line.split('|')
    lang = line2[-1]
    total_docs_counter +=1

    if lang=='-':
        no_lang_info +=1
        continue

    doc_url = line2[0]
    lang = lang .split('&')
    max_lang = ""
    max_percent = 0

    for num, ln in enumerate(lang):
        if num==0:
            continue
        tmp = ln.split('-')
        lang_id = tmp[0]
        # update_lang_statistics(total_lang_record, lang_id)
        print "%s\t%s\t1" %(label_1, lang_id)
    print "%s\t%s\t1" % (label_2, max_lang)

我为每个输出都给了一个标签,这样在reducer中我可以检测它属于哪个类别。然后通过简化条件表达式在reducer中,我可以获得所需的输出。很明显,内环输出将超过外环。我在本地进行测试,如

cat input | ./mapper.py | sort | ./reducer.py

它工作正常但是当我在hadoop中为someple数据运行这个作业时,看起来没有发生reducer动作,即输出文件包含mapper输出。映射器输出未正确地减少。我的映射器非常简单,只是将每个类别分别汇总。问题出在哪儿。还有其他更好的工作吗?

这是减速器代码

def counting_reducer(line, temp, count, label):
    label,name, freq = line.split("\t")
    freq = int(freq)
    if name == temp:
        count += freq
    else:
        if temp:
            print '%s\t%s\t%s' % (label, temp, count)
        count = freq
        temp = name

    return [temp, count]

label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record"

#TMP variables
temp_l1 = None
count_l1 = 0
temp_l2 = None
count_l2 = 0
temp_l3 = None
count_l3 = 0
temp_l4 = None
count_l4 = 0

urdu_bin_record = {}

skip = 0
for line in sys.stdin:
    line = line.strip()
    if line.startswith(label_1):
        temp_l1, count_l1 = counting_reducer(line, temp_l1, count_l1, label_1)
    elif line.startswith(label_2):
        temp_l2, count_l2 = counting_reducer(line, temp_l2, count_l2, label_2)


    else:
        print line

0 个答案:

没有答案