我运行了一个scala代码,该代码将数据汇总并输出到控制台。不幸的是,在分组操作后我得到了空值。 当前输出:
| Id |日期|计数||
| null | null 35471 |
我意识到,瓶颈是关键,当我对数据进行分组时-当我尝试使用非数字列时,输出将返回null。任何建议都将受到欢迎-我浪费了很多时间来寻找解决方案。
我的代码:
// create schema
val sensorsSchema = new StructType()
.add("SensorId", IntegerType)
.add("Timestamp", TimestampType)
.add("Value", DoubleType)
.add("State", StringType)
// read streaming data from csv...
// aggregate streaming data
val streamAgg = streamIn
.withColumn("Date", to_date(unix_timestamp($"Timestamp", "dd/MM/yyyy").cast(TimestampType)))
.groupBy("SensorId", "Date")
.count()
// write streaming data...
答案 0 :(得分:0)
我更改了代码-现在可以正常使用:
/****************************************
* STREAMING APP
* 1.0 beta
*****************************************
* read data from csv (local)
* and save as parquet (local)
****************************************/
package tk.streaming
import org.apache.spark.SparkConf
import org.apache.spark.sql._
// import org.apache.spark.sql.functions._
case class SensorsSchema(SensorId: Int, Timestamp: String, Value: Double, State: String, OperatorId: Int)
object Runner {
def main(args: Array[String]): Unit = {
// Configuration parameters (to create spark session and contexts)
val appName = "StreamingApp" // app name
val master = "local[*]" // master configuration
val dataDir = "/home/usr_spark/Projects/SparkStreaming/data"
val refreshInterval = 30 // seconds
// initialize context
val conf = new SparkConf().setMaster(master).setAppName(appName)
val spark = SparkSession.builder.config(conf).getOrCreate()
import spark.implicits._
// TODO change file source to Kafka (must)
// read streaming data
val sensorsSchema = Encoders.product[SensorsSchema].schema
val streamIn = spark.readStream
.format("csv")
.schema(sensorsSchema)
.load(dataDir + "/input")
.select("SensorId", "Timestamp", "State", "Value") // remove "OperatorId" column
// TODO save result in S3 (nice to have)
// write streaming data
import org.apache.spark.sql.streaming.Trigger
val streamOut = streamIn.writeStream
.queryName("streamingOutput")
.format("parquet")
.option("checkpointLocation", dataDir + "/output/checkpoint")
.option("path", dataDir + "/output")
.start()
streamOut.awaitTermination() // start streaming data
}
}