非规范化DataFrame到嵌套文档

时间:2016-05-27 13:49:31

标签: scala elasticsearch apache-spark

我需要将一些数据从Spark DataFrame处理到ElasticSearch索引。

我的DataFrame:

scala> source.printSchema()
root
 |-- dialogue_id: string (nullable = true)
 |-- dialogue_number: string (nullable = true)
 |-- dialogue_text: string (nullable = true)
scala> df_echanges.show
+----------------------+-----------------------+----------------------------+
|           dialogue_id|        dialogue_number|               dialogue_text|
+----------------------+-----------------------+----------------------------+
|                 DIAL1|                      1|                     Hello !|
|                 DIAL1|                      2|                        Hi !|
|                 DIAL1|                      3|               How are you ?|
|                 DIAL1|                      4|              Fine and you ?|
|                 DIAL1|                      5|                      Fine !|
|                 DIAL2|                      1|       Hello ! How are you ?|
|                 DIAL2|                      2|                      Fine !|
+----------------------+-----------------------+----------------------------+

我的目的地是ES索引,“对话”字段是嵌套的:

{
   "mappings": {
      "dialogues": {
           "properties": {
               "dialogue_id": {
               "type": "string"
            },
            "dialogue": {
               "type": "nested",
               "properties": {
                  "dialogue_number": {
                     "type": "string"
                  },
                  "dialogue_text": {
                     "type": "string"
                  }
               }
            }
         }
      }
   }
}

所以我需要将我的DataFrame转换为:

scala> dest.printSchema()
root
 |-- dialogue_id: string (nullable = true)
 |-- dialogue: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- dialogue_number: string (nullable = true)
 |    |    |-- dialogue_text: boolean (nullable = true)

怎么做?

谢谢!

杰弗里

1 个答案:

答案 0 :(得分:0)

我想最简单的方法就是逃避DataFrames的转型世界。转换完成后,您可以转换回DataFrame并执行您想要对目标模式中的数据执行的任何操作。

我会尝试以下内容:

首先确保声明了一些表示目标模式的case类(这是最简单的方法,但是case类成员的命名约定确实违反了标准" Scala代码的约定):

case class DialogueElement(dialogue_id: String, dialogue: Array[InnerDialogueElement])
case class InnerDialogueElement(dialogue_number: String, dialogue_text: String)

然后运行以下转换(主要使用RDD API):

// Transform to RDD and group by first column (= index 0)
val groupedRdd = source.rdd.groupBy(row => row.getString(0))

// Map the grouped values into a case class that represents
// your inner dialogue elements
val mappedInnerElementsRdd = groupedRdd
  .mapValues(group => group.map(r => InnerDialogueElement(r.getString(1), r.getString(2))))

// Map everything into a case class that fully represents your destination schema
val finalRdd = mappedInnerElementsRdd.map({ case (dialogueId, innerElements) => DialogueElement(dialogueId, innerElements.toArray) })

import sqlContext.implicits._ // needed for calling toDF()

val finalDF = finalRdd.toDF()

finalDF.printSchema() // should print your desired schema

除了使用确切的字段名称声明上述案例类(例如" dialogue_id")之外,您还可以使用不同的方式命名这些成员并使用以下方法从RDD手动转换回DataFrame:

sqlContext.createDataFrame(yourRDD, yourSchemaContainingTheFieldNamesYouWantToHave)

希望这会有所帮助:)

PS:将groupBy与RDD结合使用意味着每个组必须完全适合主内存!