我正在尝试创建一个数据框以作为单元测试的一部分馈入函数。如果我有以下内容
val myDf = sparkSession.sqlContext.createDataFrame(
sparkSession.sparkContext.parallelize(Seq(
Row(Some(Seq(MyObject(1024, 100001D), MyObject(1, -1D)))))),
StructType(List(
StructField("myList", ArrayType[???], true)
)))
MyObject是一个案例类。
我不知道要为对象类型添加什么。有什么建议?我已经尝试了我能想到的几乎所有组合的ArrayType。
我正在寻找一个看起来像这样的数据框:
+--------------------+
| myList |
+--------------------+
| [1024, 100001] |
| [1, -1] |
+--------------------+
答案 0 :(得分:2)
反过来...
val s = Seq(Array(1024, 100001D), Array(1, -1D)).toDS().toDF("myList")
println(s.schema)
s.printSchema
s.show
您的架构如下所示……DoubleType
即将到来,因为这100001D和-1D是双精度的。
StructType(StructField(myList,ArrayType(DoubleType,false),true))
您需要的输出:
root
|-- myList: array (nullable = true)
| |-- element: double (containsNull = false)
+------------------+
| myList|
+------------------+
|[1024.0, 100001.0]|
| [1.0, -1.0]|
+------------------+
或者您也可以这样做。
case class MyObject(a:Int , b:Double)
val s = Seq(MyObject(1024, 100001D), MyObject(1, -1D)).toDS()
.select(struct($"a",$"b").as[MyObject] as "myList")
println(s.schema)
s.printSchema
s.show
结果:
//schema :
StructType(StructField(myList,StructType(StructField(a,IntegerType,false), StructField(b,DoubleType,false)),false))
root
|-- myList: struct (nullable = false)
| |-- a: integer (nullable = false)
| |-- b: double (nullable = false)
+----------------+
| myList|
+----------------+
|[1024, 100001.0]|
| [1, -1.0]|
+----------------+
答案 1 :(得分:0)
尝试一下
scala> case class MyObject(prop1:Int, prop2:Double)
defined class MyObject
scala> val df = Seq((1024, 100001D), (1, -1D)).toDF("prop1","prop2").select(struct($"prop1",$"prop2").as[MyObject] as "myList")
df: org.apache.spark.sql.DataFrame = [myList: struct<prop1: int, prop2: double>]
scala> df.show(false)
+----------------+
|myList |
+----------------+
|[1024, 100001.0]|
|[1, -1.0] |
+----------------+
scala> df.printSchema
root
|-- myList: struct (nullable = false)
| |-- prop1: integer (nullable = false)
| |-- prop2: double (nullable = false)