Question

我正在火花壳作业中工作

--num-executors 15 
--driver-memory 15G 
--executor-memory 7G 
--executor-cores 8 
--conf spark.yarn.executor.memoryOverhead=2G 
--conf spark.sql.shuffle.partitions=500 
--conf spark.sql.autoBroadcastJoinThreshold=-1 
--conf spark.executor.memoryOverhead=800

作业卡住，无法启动该代码正在对270m大型数据集进行过滤条件的交叉联接。我已将大表270m和小表（100000）的分区增加到16000，我已经将其转换为广播变量

我为工作添加了spark ui，

所以我必须减少分区，增加执行者，任何想法

感谢您的帮助。

！[spark ui 1] [1] ！[spark ui 2] [2] ！[spark ui 3] [3] 10小时后

状态：任务：7341/16936（16624失败）

检查容器错误日志

RM Home
NodeManager
Tools
Failed while trying to construct the redirect url to the log server. Log Server url may not be configured
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all.

[每完成ui 1完成50次] [4] [每完成ui 2完成50次] [5] [1]：https：//i.stack.imgur.com/nqcys.png [2]：https：//i.stack.imgur.com/S2vwL.png [3]：https：//i.stack.imgur.com/81FUn.png [4]：https：//i.stack.imgur.com/h5MTa.png [5]：https：//i.stack.imgur.com/yDfKF.png

Answer 1

如果您可以提及您的群集配置，那将会有所帮助。

但是，由于您添加了1000个小表的广播功能，因此，但是100,000个可能不是必需的，因此您需要调整内存配置。

根据您的配置，我假设您共有$('.click-to-add').click(function () { $('.duplicate').clone().insertAfter('.duplicate').last(); });的内存。

您可以尝试使用15 * 7 = 105GB

这将为每个执行器提供更多的内存来保存广播变量。请相应调整--num-executors 7 --executor-memory 15以正确使用

火花任务无法开始执行

1 个答案: