火花会优化广播变量的网络流量吗?

时间:2019-01-27 20:01:26

标签: apache-spark

知道spark在每个工作节点上使用多个执行程序,并且每个执行程序都在其自己的JVM中运行,所以我想知道/ if spark如何优化广播变量的网络流量。希望它对每个工作节点进行一次下载,然后将已经序列化的数据发送到该特定节点上的执行器。另一种选择是每次执行者需要它时都下载广播的数据(因此必须在特定节点上多次下载相同的数据)。

1 个答案:

答案 0 :(得分:1)

是的,Spark确实使用洪流广播来优化广播。引用the source

* A BitTorrent-like implementation of [[org.apache.spark.broadcast.Broadcast]].
*
* The mechanism is as follows:
*
* The driver divides the serialized object into small chunks and
* stores those chunks in the BlockManager of the driver.
*
* On each executor, the executor first attempts to fetch the object from its BlockManager. If
* it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
* other executors if available. Once it gets the chunks, it puts the chunks in its own
* BlockManager, ready for other executors to fetch from.
*
* This prevents the driver from being the bottleneck in sending out multiple copies of the
* broadcast data (one per executor).

过去,有另一个广播实现(HTTP广播),但是在2.0中已被完全删除。

相关问题