Question

我正在尝试使用Spark构建推荐程序并且内存不足：

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space

我希望通过在运行时修改PySpark中的spark.executor.memory属性来增加Spark可用的内存。

这可能吗？如果是这样，怎么样？

更新

受@ zero323评论中的链接启发，我试图在PySpark中删除并重新创建上下文：

del sc
from pyspark import SparkConf, SparkContext
conf = (SparkConf().setMaster("http://hadoop01.woolford.io:7077").setAppName("recommender").set("spark.executor.memory", "2g"))
sc = SparkContext(conf = conf)

返回：

ValueError: Cannot run multiple SparkContexts at once;

这很奇怪，因为：

>>> sc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sc' is not defined

Answer 1

您可以在开始spark.executor.memory

时设置pyspark-shell

pyspark --num-executors 5 --driver-memory 2g --executor-memory 2g

Answer 2

我不确定为什么在需要重新启动shell并使用其他命令打开时选择上面的答案！虽然这有用并且很有用，但是有一个实际上被请求的内联解决方案。这基本上是上面评论中引用的@ zero323，但链接导致描述Scala中的实现。以下是专门针对PySpark的工作实现。

注意：您要修改设置的SparkContext必须尚未启动，否则您需要关闭它，修改设置并重新打开。

from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '2g')
sc = SparkContext("local", "App Name")

源： https://spark.apache.org/docs/0.8.1/python-programming-guide.html

P.S。如果你需要关闭SparkContext，只需使用：

SparkContext.stop(sc)

并仔细检查您可以使用的当前设置：

sc._conf.getAll()

Answer 3

据我所知，在运行时无法更改 spark.executor.memory 。即使在spark-context初始化之前，也会创建datanode上的容器。

Answer 4

引用this，在2.0.0之后，您不必使用SparkContext，而可以使用SparkSession和conf方法，如下所示：

spark.conf.set("spark.executor.memory", "2g")

在运行时增加PySpark可用的内存

4 个答案: