我需要将一些数据从Spark DataFrame处理到ElasticSearch索引。
我的DataFrame:
scala> source.printSchema()
root
|-- dialogue_id: string (nullable = true)
|-- dialogue_number: string (nullable = true)
|-- dialogue_text: string (nullable = true)
scala> df_echanges.show
+----------------------+-----------------------+----------------------------+
| dialogue_id| dialogue_number| dialogue_text|
+----------------------+-----------------------+----------------------------+
| DIAL1| 1| Hello !|
| DIAL1| 2| Hi !|
| DIAL1| 3| How are you ?|
| DIAL1| 4| Fine and you ?|
| DIAL1| 5| Fine !|
| DIAL2| 1| Hello ! How are you ?|
| DIAL2| 2| Fine !|
+----------------------+-----------------------+----------------------------+
我的目的地是ES索引,“对话”字段是嵌套的:
{
"mappings": {
"dialogues": {
"properties": {
"dialogue_id": {
"type": "string"
},
"dialogue": {
"type": "nested",
"properties": {
"dialogue_number": {
"type": "string"
},
"dialogue_text": {
"type": "string"
}
}
}
}
}
}
}
所以我需要将我的DataFrame转换为:
scala> dest.printSchema()
root
|-- dialogue_id: string (nullable = true)
|-- dialogue: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dialogue_number: string (nullable = true)
| | |-- dialogue_text: boolean (nullable = true)
怎么做?
谢谢!
杰弗里
答案 0 :(得分:0)
我想最简单的方法就是逃避DataFrames的转型世界。转换完成后,您可以转换回DataFrame并执行您想要对目标模式中的数据执行的任何操作。
我会尝试以下内容:
首先确保声明了一些表示目标模式的case类(这是最简单的方法,但是case类成员的命名约定确实违反了标准" Scala代码的约定):
case class DialogueElement(dialogue_id: String, dialogue: Array[InnerDialogueElement])
case class InnerDialogueElement(dialogue_number: String, dialogue_text: String)
然后运行以下转换(主要使用RDD API):
// Transform to RDD and group by first column (= index 0)
val groupedRdd = source.rdd.groupBy(row => row.getString(0))
// Map the grouped values into a case class that represents
// your inner dialogue elements
val mappedInnerElementsRdd = groupedRdd
.mapValues(group => group.map(r => InnerDialogueElement(r.getString(1), r.getString(2))))
// Map everything into a case class that fully represents your destination schema
val finalRdd = mappedInnerElementsRdd.map({ case (dialogueId, innerElements) => DialogueElement(dialogueId, innerElements.toArray) })
import sqlContext.implicits._ // needed for calling toDF()
val finalDF = finalRdd.toDF()
finalDF.printSchema() // should print your desired schema
除了使用确切的字段名称声明上述案例类(例如" dialogue_id")之外,您还可以使用不同的方式命名这些成员并使用以下方法从RDD手动转换回DataFrame:
sqlContext.createDataFrame(yourRDD, yourSchemaContainingTheFieldNamesYouWantToHave)
希望这会有所帮助:)
PS:将groupBy
与RDD结合使用意味着每个组必须完全适合主内存!