Spark Streaming工作时间逐渐增加

时间:2016-12-21 09:44:24

标签: scala apache-spark-sql spark-streaming

我有一个流媒体工作,

1. Load the data from hdfs file and register it as a temp table.
2. Then join the temp table with tables present in the hive database.
3. Then send the record to Kafka.

最初需要12秒完成第一个周期,然后在10小时后增加到50秒。我不明白这个问题。另外我注意到,在10小时后,每个节点的shuffle写入也在增加,它是200GB +

示例代码是,

val rowRDD = hContext.read.format("com.databricks.spark.csv").option("header", "false").
          option("delimiter", delimiter).load(path).map(col => dosomething)
//Add filter to rdd to convert the data in a time range.
val filteredRDD = rowRDD.filter { col =>{dosomething}}
//Create the data from changed data RDD and schema to new DataFrame
val tblDF = hContext.createDataFrame(filteredRDD, tblSchema).where("crud_status IN ('U','D','I')")
//Register all the records into the temporary tables.
tblDF.registerTempTable("name_changed")
val userDF = hContext.sql("SELECT id,name,account FROM name_changed JOIN account ON(name_changed.id=account.id) JOIN question 
        on (account.question=question.question)")
userDF.foreachPartition { records =>
        val producer = getKafkaProducer(kafkaBootstrap)
        records.foreach { rowData =>
            producer.send(new ProducerRecord[String, Array[Byte]](topicName, rowData) )
          }
        }
        producer.close()
}

0 个答案:

没有答案