为什么火花执行者的记忆会持续增长?

时间:2016-05-10 08:15:34

标签: python apache-spark

我将代码提交到在一个节点上运行的独立群集。该节点有16GB内存和8核CPU。

我注意到执行程序内存不断增长。很慢,但真的在增长。最终执行程序的内存将超出我指定的设置并导致程序挂起。有一些错误,如:

16/05/09 19:32:32 ERROR ContextCleaner: Error cleaning broadcast 12242
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTime

我使用的提交命令在这里:

./bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G --num-executors 1 --total-executor-cores 1 --conf "spark.storage.memoryFraction=0.2" ./mycode.py   1>a.log 2>b.log

工作量不大。在每个批次间隔期间将计算几kb数据。我的执行程序内存设置为4GB。

代码非常简单,所以我会在这里发布以防万一有什么不合适的地方。

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row, HiveContext
import pyspark.sql.functions as func
from pyspark.sql.window import Window


def getSqlContextInstance(sparkContext):
    if ('sqlContextSingletonInstance' not in globals()):
        globals()['sqlContextSingletonInstance'] = HiveContext(sparkContext)
    return globals()['sqlContextSingletonInstance']


def top_n(df, groupby_column, agg_column, ):
    df_agg = df. \
        groupBy(['tag', 'interface', groupby_column]). \
        agg(func.sum('pkts').alias('pkts'), func.sum('bits').alias('bits'))

    windowSpec = Window. \
        partitionBy(df_agg['tag'], df_agg['interface']). \
        orderBy(df_agg[agg_column].desc())

    rank = func.dense_rank().over(windowSpec)

    top_n_agg = df_agg. \
        select(df_agg['tag'], df_agg['interface'], df_agg[groupby_column], df_agg['pkts'], rank.alias('rank')). \
        filter("rank<=2")
    top_n_agg.show()


def process(time, rdd):
    if rdd.isEmpty():
        return sc.emptyRDD()

    sqlContext = getSqlContextInstance(rdd.context)

    # Convert RDD[String] to RDD[Row] to DataFrame
    parts = rdd.map(lambda l: l.split(","))
    rowRdd = parts.map(lambda p: Row(
        tag=p[0], in_iface=int(p[1]), out_iface=int(p[2]), src_ip=p[3], dst_ip=p[4], src_port=int(p[5]),
        dst_port=int(p[6]), protocol=p[7], ip_dscp=p[8], flow_direction=p[9], pkts=int(p[10]), bits=int(p[11])))

    df = sqlContext.createDataFrame(rowRdd)

    dataframe_ingress = \
        df.filter(df['flow_direction'] == 0). \
            select(df['tag'], df['in_iface'].alias('interface'), df['src_ip'], df['dst_ip'], df['src_port'],
                   df['dst_port'], df['protocol'], df['flow_direction'], df['pkts'], df['bits'])
    dataframe_ingress.cache()

    #########################################
    # top N
    #########################################
    top_n(df=dataframe_ingress, groupby_column='protocol', agg_column='pkts')
    top_n(df=dataframe_ingress, groupby_column='protocol', agg_column='bits')
    top_n(df=dataframe_ingress, groupby_column='src_ip', agg_column='pkts')
    top_n(df=dataframe_ingress, groupby_column='src_ip', agg_column='bits')
    top_n(df=dataframe_ingress, groupby_column='dst_ip', agg_column='pkts')
    top_n(df=dataframe_ingress, groupby_column='dst_ip', agg_column='bits')

    dataframe_ingress.unpersist()


if __name__ == "__main__":
    dataDirectory = '/stream/raw'

    sc = SparkContext(appName="Netflow")
    ssc = StreamingContext(sc, 15)

    lines = ssc.textFileStream(dataDirectory)
    lines.foreachRDD(process)

    ssc.start()
    ssc.awaitTermination()

0 个答案:

没有答案