Spark SQL如何使用层次结构类型表爆炸行

时间:2019-01-25 22:36:10

标签: scala apache-spark apache-spark-sql

源数据表

/Company/Engineering/DataTeam 45

/Company/Engineering/Mobile 50

输出数据表

/Company 45

/Company/Engineering 45

/Company/Engineering/DataTeam 45

/Company 50

/Company/Engineering 50

/Company/Engineering/MobileTeam 50


所以我的问题基本上是通过查看上面的源和输出数据表,从源到输出数据表的转换,如何使用spark sql来实现。

我无法使用UDF,因为使用UDF无法返回行。因此,我的下一步是在内存中创建数据帧并使用UDF追加行。但是这种方法的问题在于,数据帧将超过十亿行,我不确定那是否可行。

关于如何使用spark sql实现此目标的任何建议?

1 个答案:

答案 0 :(得分:0)

在UDF中,您可以返回Seq [String],可以将其分解以获取多行。

检查一下:

scala> val df = Seq(("/Company/Engineering/DataTeam",45),("/Company/Engineering/Mobile",50)).toDF("a","b")
df: org.apache.spark.sql.DataFrame = [a: string, b: int]

scala> df.show(false)
+-----------------------------+---+
|a                            |b  |
+-----------------------------+---+
|/Company/Engineering/DataTeam|45 |
|/Company/Engineering/Mobile  |50 |
+-----------------------------+---+

scala> val udf_hier_str = udf( (x:String) => x.split('/').drop(1).scanLeft(""){(acc, next) => acc + "/" + next}.drop(1) )
udf_hier_str: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(StringType)))

scala> df.withColumn("gen_hier",explode(udf_hier_str('a))).show(false)
+-----------------------------+---+-----------------------------+
|a                            |b  |gen_hier                     |
+-----------------------------+---+-----------------------------+
|/Company/Engineering/DataTeam|45 |/Company                     |
|/Company/Engineering/DataTeam|45 |/Company/Engineering         |
|/Company/Engineering/DataTeam|45 |/Company/Engineering/DataTeam|
|/Company/Engineering/Mobile  |50 |/Company                     |
|/Company/Engineering/Mobile  |50 |/Company/Engineering         |
|/Company/Engineering/Mobile  |50 |/Company/Engineering/Mobile  |
+-----------------------------+---+-----------------------------+


scala>
相关问题