Spark数据帧将列值更改为时间戳

时间:2018-04-10 08:36:32

标签: scala apache-spark dataframe

我有一个jsonl文件我已经读入,创建了一个临时表视图并过滤了我想要修改的记录。

val df = session.read.json("tiny.jsonl")
df.createOrReplaceTempView("tempTable")
val filter = df.select("*").where("field IS NOT NULL")

现在,我正处于尝试各种事物的部分。我想更改名为" time"的列。在我写回之前使用currentTimestamp。有时我会想将currentTimestamp更改为timestampNow - 例如5天。

val change = test.withColumn("server_time", date_add(current_timestamp(), -1))

上面的例子会让我回到今天的1,而不是时间戳。

修改: 模拟我的jsonl输入的示例数据帧:

  val df = Seq(
    (1, "fn", "2018-02-18T22:18:28.645Z"),
    (2, "fu", "2018-02-18T22:18:28.645Z"),
    (3, null, "2018-02-18T22:18:28.645Z")
  ).toDF("id", "field", "time")

预期产出:

+---+------+-------------------------+
| id|field |time                     |
+---+------+-------------------------+
|  1| fn   | 2018-04-09T22:18:28.645Z|
|  2| fn   | 2018-04-09T22:18:28.645Z|
+---+------+-------------------------+

1 个答案:

答案 0 :(得分:1)

如果您想用当前column替换当前time timestamp,则可以使用current_timestamp功能。要添加可以使用SQL INTERVAL

的天数
val df = Seq(
  (1, "fn", "2018-02-18T22:18:28.645Z"),
  (2, "fu", "2018-02-18T22:18:28.645Z"),
  (3, null, "2018-02-18T22:18:28.645Z")
).toDF("id", "field", "time")
  .na.drop()

  val ddf  = df
    .withColumn("time", current_timestamp())
    .withColumn("newTime", $"time" + expr("INTERVAL 5 DAYS"))

输出:

+---+-----+-----------------------+-----------------------+
|id |field|time                   |newTime                |
+---+-----+-----------------------+-----------------------+
|1  |fn   |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
|2  |fu   |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
+---+-----+-----------------------+-----------------------+