Question

我从MongoDB读取数据，然后映射到InteractionItem。

 val df = filterByParams(startTs, endTs, widgetIds, documents)
    .filter(item => {
      item._2.get("url") != "localhost" && !EXCLUDED_TRIGGERS.contains(item._2.get("trigger"))
    })
    .flatMap(item => {
      var res = Array[InteractionItem]()

      try {
        val widgetId = item._2.get("widgetId").toString
        val timestamp = java.lang.Long.parseLong(item._2.get("time").toString)
        val extra = item._2.get("extra").toString
        val extras = parseExtra(extra)
        val c = parseUserAgent(extras.userAgent.getOrElse(""))
        val os = c.os.family
        val osVersion = c.os.major
        val device = c.device.family
        val browser = c.userAgent.family
        val browserVersion = c.userAgent.major
        val adUnit = extras.adunit.get
        val gUid = extras.guid.get
        val trigger = item._2.get("trigger").toString
        val objectName = item._2.get("object").toString
        val response = item._2.get("response").toString
        val ts: Long = timestamp - timestamp % 3600


        //
        val interaction = interactionConfiguration.filter(interaction =>
          interaction.get("trigger") == trigger &&
            interaction.get("object") == objectName &&
            interaction.get("response") == response).head
        val clickThrough = interaction.get("clickThrough").asInstanceOf[Boolean]
        val interactionId = interaction.get("_id").toString

        adUnitPublishers.filter(x => x._2._2.toString == widgetId && x._1.toString == adUnit).foreach(publisher => {
          res = res :+ InteractionItem(widgetId, ts, adUnit, publisher._2._1.toString, os, osVersion, device, browser, browserVersion,
            interactionId, clickThrough, 1L, gUid)
        })
        bdPublishers.filter(x => x._1.toString == widgetId).foreach(publisher => {
          res = res :+ InteractionItem(widgetId, ts, adUnit, publisher._2.toString, os, osVersion, device, browser, browserVersion,
            interactionId, clickThrough, 1L, gUid)
        })
      }
      catch {
        case e: Exception => {
          log.info(e.getMessage)
          res = res :+ InteractionItem.invalid()
        }
      }
      res

    }).filter(i => i.interactionCount > 0)

使用RDD方式我再次映射并reduceByKey

.map(i => ((i.widgetId, i.date, i.section, i.publisher, i.os, i.device, i.browser, i.clickThrough, i.id), i.interactionCount))
        .reduceByKey((a, b) => a + b)

使用DataFrame方式我转换

.toDF()

            df.registerTempTable("interactions")
            df.cache()
            val v = sqlContext.sql("SELECT id, clickThrough, widgetId, date, section, publisher, os, device, browser, interactionCount" +
              " FROM interactions GROUP BY id, clickThrough, widgetId, date, section, publisher, os, device, browser, interactionCount")

从我在Spark UI中看到的对于使用Dataframe，它需要210个阶段？

对于RDD，它只有20个阶段：

我在这里做错了什么？

Answer 1

您在RDD＆amp; DF不一样。
DF具有更长处理时间的原因是由于以下额外任务：

registerTempTable（）
缓存（）

虽然RDD仅减少一个给定表达式，但DF会将整个数据作为表处理，并且还会准备缓存，从而消耗额外的CPU和存储资源。 / p>

在Spark 1.5.2中使用dataframe和rdd有什么区别？

1 个答案: