我该怎么做?

时间:2020-06-11 03:30:08

标签: apache-spark pyspark apache-spark-sql

env:spark2.4.5

source.json:

{
    "a_key": "1",
    "a_pro": "2",
    "a_con": "3",
    "b_key": "4",
    "b_pro": "5",
    "b_con": "6",
    "c_key": "7",
    "c_pro": "8",
    "c_con": "9",
    ...
}

traget.json:

{
    "factors": [
        {
            "name": "a",
            "key": "1",
            "pros": "2",
            "cons": "3"
        },
        {
            "name": "b",
            "key": "4",
            "pros": "5",
            "cons": "6"
        },
        {
            "name": "c",
            "key": "7",
            "pros": "8",
            "cons": "9"
        },
        ...
    ]
}

如您所见,目标“名称”是来源关键的一部分。例如,“ a”是“ a_key”,“ a_pro”,“ a_con”的“名称”。我真的不知道如何从键中提取值,并进行一些“分组”转换。有人可以给我一些建议吗?

3 个答案:

答案 0 :(得分:1)

IIUC首先根据输入的json创建数据框

json_data = {
    "a_key": "1",
    "a_pro": "2",
    "a_con": "3",
    "b_key": "4",
    "b_pro": "5",
    "b_con": "6",
    "c_key": "7",
    "c_pro": "8",
    "c_con": "9"
}
df=spark.createDataFrame(list(map(list,json_data.items())),['key','value'])
df.show()

+-----+-----+
|  key|value|
+-----+-----+
|a_key|    1|
|a_pro|    2|
|a_con|    3|
|b_key|    4|
|b_pro|    5|
|b_con|    6|
|c_key|    7|
|c_pro|    8|
|c_con|    9|
+-----+-----+

现在从现有列中创建所需的列

import pyspark.sql.functions as  f
df2 = df.withColumn('Name', f.substring('key',1,1)).\
         withColumn('Attributes', f.concat(f.split('key','_')[1],f.lit('s')))
df2.show()
+-----+-----+----+----------+
|  key|value|Name|Attributes|
+-----+-----+----+----------+
|a_key|    1|   a|      keys|
|a_pro|    2|   a|      pros|
|a_con|    3|   a|      cons|
|b_key|    4|   b|      keys|
|b_pro|    5|   b|      pros|
|b_con|    6|   b|      cons|
|c_key|    7|   c|      keys|
|c_pro|    8|   c|      pros|
|c_con|    9|   c|      cons|
+-----+-----+----+----------+

现在旋转数据框并将结果收集为json对象

output_json = df2.groupBy('Name').\
                  pivot('Attributes').\
                  agg(f.min('value')).\         
                  select(f.collect_list(f.struct('Name','keys','cons','pros')).alias('factors')).\
                  toJSON().collect()

import json
print(json.dumps(json.loads(output_json[0]),indent=4))

{
    "factors": [
        {
            "Name": "c",
            "keys": "7",
            "cons": "9",
            "pros": "8"
        },
        {
            "Name": "b",
            "keys": "4",
            "cons": "6",
            "pros": "5"
        },
        {
            "Name": "a",
            "keys": "1",
            "cons": "3",
            "pros": "2"
        }
    ]
}

答案 1 :(得分:0)

无需为此涉及数据帧,只需执行一些简单的字符串和字典操作即可:

import json

source = {
    "a_key": "1",
    "a_pro": "2",
    "a_con": "3",
    "b_key": "4",
    "b_pro": "5",
    "b_con": "6",
    "c_key": "7",
    "c_pro": "8",
    "c_con": "9",
}

factors = {}

# Prepare each factor dictionary
for k, v in source.items():
    factor, item = k.split('_')
    d = factors.get(factor, {})
    d[item] = v
    factors[factor] = d

# Prepare result dictionary
target = {
    'factors': []
}

# Move name attribute into dictionary & append
for k, v in factors.items():
    d = v
    d['name'] = k
    target['factors'].append(d)

result = json.dumps(target)

答案 2 :(得分:0)

您的数据很奇怪,但是以下代码可以帮助您解决问题:

source.json:

    {
      "a_key": "1",
      "a_pro": "2",
      "a_con": "3",
      "b_key": "4",
      "b_pro": "5",
      "b_con": "6",
      "c_key": "7",
      "c_pro": "8",
      "c_con": "9"
}

代码:

val sparkSession = SparkSession.builder()
  .appName("readAndWriteJsonTest")
  .master("local[*]").getOrCreate()

val dataFrame = sparkSession.read.format("json").load("R:\\data\\source.json")

// println(dataFrame.rdd.count())

val mapRdd: RDD[(String, (String, String))] = dataFrame.rdd.map(_.getString(0))
 .filter(_.split("\\:").length == 2)
 .map(line => {
  val Array(key1, value1) = line.split("\\:")
  val Array(name, key2) = key1.replace("\"", "").trim.split("\\_")
  val value2 = value1.replace("\"", "").replace(",", "").trim
  (name, (key2, value2))
})

// mapRdd.collect().foreach(println)

val initVale = new ArrayBuffer[(String, String)]

val function1 = (buffer1: ArrayBuffer[(String, String)], t1: (String, String)) => buffer1.+=(t1)
val function2 = (buffer1: ArrayBuffer[(String, String)], buffer2: ArrayBuffer[(String, String)]) => buffer1.++(buffer2)

val aggRdd: RDD[(String, ArrayBuffer[(String, String)])] = mapRdd.aggregateByKey(initVale)(function1, function2)

// aggRdd.collect().foreach(println)

import scala.collection.JavaConverters._
val persons: util.List[Person] = aggRdd.map(line => {
  val name = line._1
  val keyValue = line._2(0)._2
  val prosValue = line._2(1)._2
  val consvalue = line._2(2)._2

  Person(name, keyValue, prosValue, consvalue)
}).collect().toList.asJava


import com.google.gson.GsonBuilder
val gson = new GsonBuilder().create

val factors = Factors(persons)

val targetJsonStr = gson.toJson(factors)

println(targetJsonStr)

traget.json:

{
  "factors": [
  {
    "name": "a",
    "key": "1",
    "pros": "2",
    "cons": "3"
  },
  {
    "name": "b",
    "key": "4",
    "pros": "5",
    "cons": "6"
  },
  {
    "name": "c",
    "key": "7",
    "pros": "8",
    "cons": "9"
  }
  ]
}

您可以将上面的代码放入测试方法中,然后运行它以查看所需的结果。

  @Test
  def readAndSaveJsonTest: Unit = {}

希望它可以为您提供帮助。