Question

我正在使用Spark结构化流来从Kafka获取流数据。我需要汇总各种指标（说6个指标）并写为镶木地板文件。我确实看到度量标准1和度量标准2之间存在巨大延迟。例如，如果度量标准1最近更新，则度量标准2是一小时数据。如何提高此性能以并行工作？

另外，我编写的Parquet文件应由其他应用程序读取。如何不断清除旧镶木地板信息？我应该有不同的申请吗？

Dataset<String> lines_topic = spark.readStream().format("kafka").option("kafka.bootstrap.servers", bootstrapServers) 
Dataset<Row> data= lines_topic.select(functions.from_json(lines_topic.col("value"), schema).alias(topics)); data.withWatermark(---).groupBy(----).count(); query = data.writeStream().format("parquet").option("path",---).option("truncate", "false").outputMode("append").option("checkpointLocation", checkpointFile).start();

Answer 1

由于每个查询都独立于其他查询运行，因此您需要确保为每个查询提供足够的资源来执行。可能发生的情况是，如果您使用默认的FIFO scheduler，那么所有触发器都是顺序运行而不是并行运行。

正如here所述，您应该在SparkContext上设置FAIR scheduler，然后为每个查询定义新池。

// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)

// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)

此外，在清除旧镶木地板文件方面，您可能希望对数据进行分区，然后根据需要定期删除旧分区。否则，如果所有数据都写入同一输出路径，则不能只删除行。

结构化流媒体性能和清除镶木地板文件

1 个答案: