如何计算JSON文件的行数?

时间:2018-09-30 20:03:33

标签: scala apache-spark

下面的我的JSON文件包含六行:

[
    {"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:12 EST","n":"est"}]],
     "apps":[],
     "agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},
     "header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"12","n":"cpu"},{"v":"154665","n":"seq"},{"v":"2016-08-24 14:23:17 EST","n":"est"}]
    },
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:14 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"5","n":"cpu"},{"v":"154666","n":"seq"},{"v":"2016-08-24 14:23:23 EST","n":"est"}]},
{"events":[[{"v":"LOGOFF","n":"type"},{"v":"2016-08-24 14:24:04 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"0","n":"cpu"},{"v":"154667","n":"seq"},{"v":"2016-08-24 14:24:05 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"O","n":"state"},{"v":"5376","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"29","n":"cpu"},{"v":"154668","n":"seq"},{"v":"2016-09-25 16:57:24 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"16","n":"cpu"},{"v":"154669","n":"seq"},{"v":"2016-09-25 16:57:30 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"17","n":"cpu"},{"v":"154670","n":"seq"},{"v":"2016-09-25 16:57:36 EST","n":"est"}]}
]

JSON类似于以下记录:

JSON
0
1
2
3
4
5

必需的输出:

Count
6

2 个答案:

答案 0 :(得分:1)

好,您在Spark中,需要将Json转换为数据集,并对其执行适当的操作。因此,在这里,我编写了从Json到数据集的工作流程,并编写了示例所需的步骤。我认为这种回答方式更有益,因为您可以查看步骤,然后可以决定如何处理信息。

  1. 输入数据:您拥有Json,这就是您应该开始处理的数据。然后,您需要确定哪些字段很重要。在大多数情况下,仅靠计数是很小的一部分,并且您不想加载所有不必要的字段。

  2. 创建案例类:您可以使用案例类,因为这样您就可以序列化输入数据。为简单起见,我有一个属于部门的医生,并且我在Json中获取数据。我可以使用以下案例类:

    case class Department(name: String, address: String)
    case class Doctor(name: String, department: Department)
    

    因此,从上面的代码中可以看到,我自下而上创建了我要处理的数据。在您的Json中,有许多我无法理解的字段(例如v)的含义。因此,请注意不要混合使用。

  3. 具有数据集:好的,下面的代码将Json序列化为我们定义的case类:

    spark.read.json("doctorsData.json).as[Doctor]
    

    两点。 spark是一个Spark会话,您需要创建它。这里的实例是spark,可以是任何实例。您还需要import spark.implicits._

  4. 在企业中!:好的,您现在在从事商业,并且在Spark世界中。只需使用count()对数据集进行计数即可。以下方法显示了如何进行计数:

    def recordsCount(myDataset: Dataset[Doctor]): Long = myDataset.count()
    

答案 1 :(得分:0)

我拥有的三个记录的文件-格式正确的Spark 2.x,读入数据框/数据集:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._

val df = spark.read
        .option("multiLine", true)
        .option("mode", "PERMISSIVE")
        .option("inferSchema", true)
        .json("/FileStore/tables/json_01.txt")

df.select("*").show(false)
df.printSchema()
df.count()

如果仅统计总数,那么就够了,最后一行。

res15: Long = 3
相关问题