hive sql

时间:2016-01-19 09:30:47

标签: java memory hive

我运行以下hql:

select new.uid as uid, new.category_id as category_id, new.atag as atag,
new.rank_idx + CASE when old.rank_idx is not NULL then old.rank_idx else 0 END as rank_idx 
from (
        select a1.uid, a1.category_id, a1.atag, row_number() over(distribute by a1.uid, a1.category_id sort by a1.cmt_time) as rank_idx from (
            select app.uid, 
            CONCAT(cast(app.knowledge_point_id_list[0] as string),'#',cast(app.type_id as string))  as category_id, 
            app.atag as atag, app.cmt_time as cmt_time 
            from model.mdl_psr_app_behavior_question_result app 
            where app.subject = 'english' 
            and app.dt = '2016-01-14'
            and app.cmt_timelen > 1000
            and app.cmt_timelen < 120000
        ) a1 
    ) new
left join (
    select uid, category_id, rank_idx from model.mdl_psr_mlc_app_count_last
    where subject = 'english'
    and dt = '2016-01-13'
    ) old
on new.uid = old.uid
and new.category_id = old.category_id 

最初mdl_psr_mlc_app_count_last和mdl_psr_mlc_app_count_day存储为JsonSerde,查询运行。

我的同事认为JsonSerde效率很低,占用的空间太大。 PARQUET对我来说是更好的选择。

当我这样做时,查询破坏了以下错误日志:

  

org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理1行:used memory = 1024506232   2016-01-19 16:36:56,119 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理10行:used memory = 1024506232   2016-01-19 16:36:56,130 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理100行:used memory = 1024506232   2016-01-19 16:36:56,248 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理1000行:used memory = 1035075896   2016-01-19 16:36:56,694 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理10000行:used memory = 1045645560   2016-01-19 16:36:57,056 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理100000行:used memory = 1065353232

它看起来像java内存问题。有人建议我试试:

SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=8048;
SET mapreduce.reduce.java.opts='-Xmx8048M';
SET mapreduce.map.memory.mb=1024; 
set mapreduce.map.java.opts='-Xmx4096M';
set mapred.child.map.java.opts='-Xmx4096M';

它仍然会中断,并显示相同的错误消息。现在别人建议:

SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=1024;
SET mapreduce.reduce.java.opts='-Xmx1024M';
SET mapreduce.map.memory.mb=1024; 
set mapreduce.map.java.opts='-Xmx1024M';
set mapreduce.child.map.java.opts='-Xmx1024M';
set mapred.reduce.tasks = 40;

现在它运行没有故障。

有人能解释我为什么吗?

================================ 顺便说一句:虽然它运行,但减少步骤非常慢。当你在这里时,你能解释一下为什么吗?

1 个答案:

答案 0 :(得分:0)

出于某种原因,YARN对镶木地板的支持很差。

引用Mapr

  

例如,如果MapReduce作业对镶木地板文件进行排序,则Mapper需要将整个Parquet行组缓存在内存中。我已经做了测试来证明镶木地板文件的行组大小越大,需要更大的Mapper内存。在这种情况下,请确保Mapper内存足够大而不会触发OOM。

我不确定为什么问题中的不同设置很重要,但简单的解决方案是放下镶木地板并使用兽人。交换无bug时会有一点性能损失。