Spark - 如何停止重试并忽略异常

时间:2015-10-12 05:34:15

标签: apache-spark spark-streaming

我在本地运行spark以了解 countByValueAndWindow 如何工作

val Array(brokers, topics) = Array("192.xx.xx.x", "test1")

// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("ReduceByWindowExample").setMaster("local[1,1]")
val ssc = new StreamingContext(sparkConf, Seconds(2)) // batch size 2
ssc.checkpoint("D:\\SparkCheckPointDirectory")
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
  ssc, kafkaParams, topicsSet)

// Get the lines, split them into words, count the words and print
val lines = messages.map(_._2.toInt)
val keyValuelines = lines.map { x => (x, 1) }

val windowedlines=lines.countByValueAndWindow(Seconds(4),Seconds(2))
//window,interval
//    val windowedlines = lines.reduceByWindow((x, y) => { x + y }, Seconds(4) , Seconds(2))
    windowedlines.print()

ssc.start()
ssc.awaitTermination()

所有工作文件直到kafka主题提供数字数据,因为我正在使用toInt,当在kafka主题上写入空白字符串“”时,它失败抱怨NumberFormatExceotion即可,但问题是它再次无情地重试这个并再次抱怨相同的NumberFormatException 有没有办法控制时间点火花会尝试将字符串转换为Int,就像Spark应该只尝试[次]然后移动到下一批数据

2 个答案:

答案 0 :(得分:0)

虽然可能有一种方法可以为特定记录配置最大重试次数,但我认为正确的方法是实际处理异常。我相信以下代码应该过滤掉例外记录:

import scala.util.Try
...
val keyValueLines = messages.flatMap { case (e1, e2) =>
  val e2int = Try(e2.toInt)
  if (e2int.isSuccess) Option((e2int.get, 1)) else None
}

flatMap()转换会从结果中删除None,同时从(Int, Int)中为所有其他记录提取Option元组。

答案 1 :(得分:0)

您应该使用异常处理作为java,scala类型语言的最佳功能,以确保程序不会失败。在这里我编辑代码的方式,请验证它是否适合您。

import scala.util.Try

val Array(brokers, topics) = Array("192.xx.xx.x", "test1")

// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("ReduceByWindowExample").setMaster("local[1,1]")
val ssc = new StreamingContext(sparkConf, Seconds(2)) // batch size 2
ssc.checkpoint("D:\\SparkCheckPointDirectory")
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
  ssc, kafkaParams, topicsSet)

// Get the lines, split them into words, count the words and print

val lines = messages.map(x => {
     val convertedValue = Try(x._2.toInt)
if (convertedValue.isSuccess) convertedValue.get else 0
})

val keyValuelines = lines.map { x => (x, 1) }

val windowedlines=lines.countByValueAndWindow(Seconds(4),Seconds(2))
//window,interval
//    val windowedlines = lines.reduceByWindow((x, y) => { x + y }, Seconds(4) , Seconds(2))
    windowedlines.print()

ssc.start()
ssc.awaitTermination()