Spark SQL从单词字符串中拆分或提取单词

时间:2019-06-02 17:24:40

标签: scala apache-spark apache-spark-sql user-defined-functions scala-collections

我有一个如下所示的火花数据框。我正在尝试将该列分为另外2列:

date   time    content

28may  11am    [ssid][customerid,shopid]
val personDF2 = personDF.withColumn("temp",split(col("content"),"\\[")).select(
  col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s/col$i)): _*)
date time   content                       col1   col2        col3

28may 11    [ssid][customerid,shopid]     ssid   customerid  shopid

2 个答案:

答案 0 :(得分:1)

假定一个字符串表示一个单词数组。收到您的请求。您还可以优化数据帧的数量,以减少系统的负载。如果列数超过9个等,则可能需要对c10等使用c00,c01等。或者仅将整数用作列的名称。留给你。

import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray

// Set up data
val df = spark.sparkContext.parallelize(Seq(
       ("A", "[foo][customerid,shopid][Donald,Trump,Esq][single]"),
       ("B", "[foo]")
     )).toDF("k", "v")

val df2 =  df.withColumn("words_temp",  regexp_replace($"v", lit("]"), lit("" )))
val df3 = df2.withColumn("words_temp2", regexp_replace($"words_temp", lit(","), lit("[" ))).drop("words_temp") 
val df4 = df3.withColumn("words_temp3", expr("substring(words_temp2, 2, length(words_temp2))")).withColumn("cnt", expr("length(words_temp2)")).drop("words_temp2") 
val df5 = df4.withColumn("words",split(col("words_temp3"),"\\[")).drop("words_temp3") 
val df6 = df5.withColumn("num_words", size($"words"))  
val df7 = df6.withColumn("v2", explode($"words"))

// Convert to Array of sorts via group by
val df8 = df7.groupBy("k")
            .agg(collect_list("v2"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df8.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df9 = rdd3.toDF("k", "v")
val df10 = df9.withColumn("vn", explode($"v"))
val df11 = df10.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")

// Final manipulation
val result = df11.groupBy("k")
                 .pivot("c")
                 .agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
 result.show(100,false)

在这种情况下返回:

+---+---+----------+------+------+-----+----+------+
|k  |c0 |c1        |c2    |c3    |c4   |c5  |c6    |
+---+---+----------+------+------+-----+----+------+
|B  |foo|null      |null  |null  |null |null|null  |
|A  |foo|customerid|shopid|Donald|Trump|Esq |single|
+---+---+----------+------+------+-----+----+------+

答案 1 :(得分:0)

更新: 基于原始标题说明单词数组。查看其他答案。

如果是新手,那么这里有几件事。我也可以使用数据集和地图来完成。这是使用DF和rdd的解决方案。将来我可能会研究一个完整的DS,但这可以肯定并且可以大规模使用。

// Can amalgamate more steps

import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray

// Set up data
val df = spark.sparkContext.parallelize(Seq(
    ("A", Array(Array("foo", "bar"), Array("Donald", "Trump","Esq"), Array("single"))),
    ("B", Array(Array("foo2", "bar2"), Array("single2"))),
    ("C", Array(Array("foo3", "bar3", "x", "y", "z")))
     )).toDF("k", "v")
// flatten via 2x explode, can be done more elegeantly with def or UDF, but keeping it simple here
val df2 = df.withColumn("v2", explode($"v"))
val df3 = df2.withColumn("v3", explode($"v2"))
// Convert to Array of sorts via group by
val df4 = df3.groupBy("k")
            .agg(collect_list("v3"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df4.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df5 = rdd3.toDF("k", "v")
val df6 = df5.withColumn("vn", explode($"v"))
val df7 = df6.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")

// Final manipulation
val result = df7.groupBy("k")
               .pivot("c")
               .agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)

以正确的col顺序返回:

+---+----+----+-------+-----+----+------+
|k  |c0  |c1  |c2     |c3   |c4  |c5    |
+---+----+----+-------+-----+----+------+
|B  |foo2|bar2|single2|null |null|null  |
|C  |foo3|bar3|x      |y    |z   |null  |
|A  |foo |bar |Donald |Trump|Esq |single|
+---+----+----+-------+-----+----+------+