Spark streaming group rdd by key and group on Paired RDDs and pick latest from each group

时间:2017-12-18 06:42:05

标签: scala apache-spark spark-streaming

New to spark and scala. Trying to achieve below. My Messages look like below (key, id, version, dataObject)

val transformedRDD = processedMessages.flatMap(message => {
    message.isProcessed match {
      case true => Some(message.key, message.id, message.version, message)
      case false => None
    }
  }).groupByKey

I want to group by ID on each message and get latest version of message, then groupbykey, then call a predefined method which looks like below

Ingest(key,RDD[dataObject])

1 个答案:

答案 0 :(得分:0)

In most cases, you should avoid <?xml version="1.0" encoding="utf-8"?> <configuration> <configSections> </configSections> <system.diagnostics> <assert assertuienabled="false"/> </system.diagnostics> <startup> <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.5"/> </startup> </configuration> as it may result in a re-shuffle which can be very expensive. In your use case, you do not require a groupByKey and can use groupByKey instead.

reduceByKey