Question

我的Spark Streaming作业需要处理RDD[String]，其中String对应于csv文件的一行。我事先不知道架构所以我需要从RDD推断出架构，然后将其内容写入parquet文件。如果我正在从磁盘读取csv文件，我可以将所有内容加载到带有模式推断的DataFrame中，并立即将其写入镶木地板。在我的场景中，我的起点是RDD[String]，我得到的是一个流。

Answer 1

可以在Spark 1.6.x中执行此操作，因为Databricks的csv library支持使用csv解析器转换RDD[String]的方法。在Spark版本＆gt; = 2.0中，此支持已合并到主项目中，并且此方法已从界面中删除。此外，许多方法都是私有的，所以它更难以解决，但也许值得探索基础的univocity parsing library

在Spark 1.6.1上使用Databricks的Spark CSV支持，我们可以这样做：

import com.databricks.spark.csv.CsvParser

val sqlContext = new SQLContext(sparkContext)
val parser = new CsvParser().withInferSchema(true)

val rdd = sparkContext.textFile("/home/maasg/playground/data/sample-no-header.csv")
rdd.take(1) // show a sample data 
// Array[String] = Array(2000,JOHN,KINGS,50)

val df = parser.csvRdd(sqlContext, rdd)
df.schema() // let's inspect the inferred schema
// org.apache.spark.sql.types.StructType = StructType(StructField(C0,IntegerType,true), StructField(C1,StringType,true), StructField(C2,StringType,true), StructField(C3,IntegerType,true))
df.write.parquet("/tmp/sample.parquet") // write it to parquet

在foreachRDD{rdd => ...}调用中将此类代码集成到Spark Streaming中应该是微不足道的。

Answer 2

您需要将RDD[String]转换为RDD[Row]，然后才能传递架构以将RDD[Row]转换为DataFrame。

请参考。

https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Answer 3

当你有一个RDD[String]时，你也应该有一个相同的字符串格式的架构（或任何形式 - 你可以解析）

所以，现在，

// If we just thing of TWO FIELDS
val schema = "f1;f2"

// Generate the schema based on the string of schema
val f = schema.split(";").map(fn => StructField(fn, StringType))
val schema = StructType(f)

// Convert records of the RDD[String] to Rows
// Assuming each row in CSV have -comma- as delimiter
val rowRDD = <rdd>.map(_.split(",")).map(array => Row(array(0), array(1)))

// Apply the schema to the RDD
val df = spark.createDataFrame(rowRDD, schema)

您现在可以使用df实例将其保存为拼花格式。

如何使用模式推断将RDD [String]写入镶木地板文件？

3 个答案: