Question

以下代码段需要花费大量时间来处理群集中的4Gb原始数据：

df.select("type", "user_pk", "item_pk","timestamp")
      .withColumn("date",to_date(from_unixtime($"timestamp")))
      .filter($"date" > "2018-04-14")
      .select("type", "user_pk", "item_pk")
      .map {
        row => {
          val typef = row.get(0).toString
          val user = row.get(1).toString
          val item = row.get(2).toString
          (typef, user, item)
        }
      }

输出应为Dataset[(String,String,String)]类型。

我猜map部分需要花费很多时间。有没有办法优化这段代码？

Answer 1

我严重怀疑map是否存在问题，但我根本不会使用它并使用标准Dataset转换器

import df.sparkSession.implicits._

df.select("type", "user_pk", "item_pk","timestamp")
  .withColumn("date",to_date(from_unixtime($"timestamp")))
  .filter($"date" > "2018-04-14")
  .select($"type" cast "string", $"user_pk" cast "string", $"item_pk" cast "string")
  .as[(String,String,String)]

Answer 2

您正在使用日期类型创建date列，然后将其与字符串进行比较？我假设下面发生了一些隐式转换（对于过滤时的每一行）。

相反，我将该字符串转换为日期时间戳并进行整数比较（因为您使用from_unixtime我假设时间戳存储为System.currenttimemillis或类似）：

timestamp = some_to_timestamp_func("2018-04-14")
df.select("type", "user_pk", "item_pk","timestamp")
  .filter($"timestamp" > timestamp)
... etc

优化使用map动作的一段代码

2 个答案: