Spark数据帧数据库csv附加额外的双引号

时间:2017-06-07 13:51:46

标签: apache-spark apache-spark-sql databricks

当我将CONCAT应用于dataframe中的spark sql并将dataframe作为csv文件存储在HDFS位置时,似乎额外的双引号仅在输出文件中添加到concat列。

当我appy show时,不会添加此双引号。仅当我将dataframe存储为csv文件时才会添加此双引号

似乎我需要删除在将dataframe保存为csv文件时添加的额外双引号。

我正在使用com.databricks:spark-csv_2.10:1.1.0 jar

Spark版本是1.5.0-cdh5.5.1

输入:

 campaign_file_name_1, campaign_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89,    1
 campaign_file_name_1, campaign_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk,    2

预期产出:

 campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89,     campaign_name_1"="1,  2017-06-06 17:09:31
 campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk,   campaign_name_1"="2,  2017-06-06 17:09:31

Spark Code:

  object campaignResultsMergerETL extends BaseETL {

  val now  = ApplicationUtil.getCurrentTimeStamp()
  val conf = new Configuration()
  val fs  = FileSystem.get(conf)
  val log = LoggerFactory.getLogger(this.getClass.getName)

  def main(args: Array[String]): Unit = {
    //---------------------
    code for sqlContext Initialization 
    //---------------------
    val campaignResultsDF  = sqlContext.read.format("com.databricks.spark.avro").load(campaignResultsLoc)
    campaignResultsDF.registerTempTable("campaign_results")
    val campaignGroupedDF =  sqlContext.sql(
   """
    |SELECT campaign_file_name,
    |campaign_name,
    |tracker_id,
    |SUM(campaign_measure) AS campaign_measure
    |FROM campaign_results
    |GROUP BY campaign_file_name,campaign_name,tracker_id
  """.stripMargin)

    campaignGroupedDF.registerTempTable("campaign_results_full")

    val campaignMergedDF =  sqlContext.sql(
  s"""
    |SELECT campaign_file_name,
    |tracker_id,
    |CONCAT(campaign_name,'\"=\"' ,campaign_measure),
    |"$now" AS audit_timestamp
    |FROM campaign_results_full
  """.stripMargin)

   campaignMergedDF.show(20)
   saveAsCSVFiles(campaignMergedDF, campaignResultsExportLoc, numPartitions)

   }


    def saveAsCSVFiles(campaignMeasureDF:DataFrame,hdfs_output_loc:String,numPartitions:Int): Unit =
    {
       log.info("saveAsCSVFile method started")
       if (fs.exists(new Path(hdfs_output_loc))){
          fs.delete(new Path(hdfs_output_loc), true)
       }
     campaignMeasureDF.repartition(numPartitions).write.format("com.databricks.spark.csv").save(hdfs_output_loc)
       log.info("saveAsCSVFile method ended")
    }

 }

campaignMergedDF.show(20)的结果是正确的,并且正常。

 campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89,   campaign_name_1"="1,  2017-06-06 17:09:31
 campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk,   campaign_name_1"="2,  2017-06-06 17:09:31

saveAsCSVFiles的结果:这是不正确的。

 campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89,   "campaign_name_1""=""1",  2017-06-06 17:09:31
 campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk,   "campaign_name_1""=""2",  2017-06-06 17:09:31

有人可以帮我解决这个问题吗?

1 个答案:

答案 0 :(得分:1)

使用时

write.format("com.databricks.spark.csv").save(hdfs_output_loc)

为了将包含"的文本写入csv文件,您将面临问题,因为{em> spark-csv <将"符号定义为默认引号/ p>

"中的默认引号替换为其他内容(例如NULL)应该允许您按原样将"写入文件。

write.format("com.databricks.spark.csv").option("quote", "\u0000").save(hdfs_output_loc)

<强>解释

您使用的是默认的spark-csv:

  • 转义值为\
  • 引用"

spark-csv doc

  • quote:默认情况下,引号字符为“,但可以设置为任何字符。引号内的分隔符将被忽略
  • escape:默认情况下,转义字符为\,但可以设置为任何字符。转义的引号字符将被忽略

This answer建议如下:

  

关闭双引号字符的默认转义的方法   (“)用反斜杠字符() - 即避免全部逃逸   完全是字符,你必须使用just添加.option()方法调用   .write()方法调用后的正确参数。的目标   option()方法调用是改变csv()方法“找到”的方式   发出内容的“引用”字符的实例。至   要做到这一点,你必须改变“引用”实际意味着的默认值;   即改变寻求双引号字符的字符   (“)为Unicode”\ u0000“字符(基本上提供Unicode   NUL字符假设它不会出现在文档中。)