Question

我有pyspark数据框，它具有' n '行数，每行只有一列结果

结果列的内容为JSON

public function brochureUpdate(Request $request)
{       
        dd($request->brochure_file);
}



RuntimeException
stream_socket_sendto(): Connection refused

df.show（）：

现在，我要检查多少记录（ROWS）具有 attributes 元素以及没有多少记录。

我试图在Spark中使用 array_contains，过滤和爆炸 函数，但没有得到结果。

有什么建议吗？

Answer 1

import org.apache.spark.sql.functions._

df.select(get_json_object($"result", "$.attributes").alias("attributes")) .filter(col("attributes").isNotNull).count()

使用此逻辑，我们可以获取属性现有记录的计数

供您参考，请阅读本 https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html

另一种解决方案，如果您输入的是JSON格式，则

val df = spark.read.json("path of json file")
df.filter(col("attributes").isNotNull).count()

我们可以在python中获得类似的API。

Answer 2

下面的简单逻辑经过大量的努力工作

total_count = old_df.count()
new_df = old_df.filter(old_df.result.contains("attributes"))
success_count = new_df.count()
failure_count = total_count - success_count

过滤Spark中的有效和无效记录

2 个答案: