如何使用文本限定符来读取scala

时间:2017-07-13 14:12:11

标签: scala apache-spark dataframe

我想读取包含以下数据的文件:

"Name","Surname","Age","Birthdate","Address","PhoneNumber"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi","9022"
"Tanvi","Mutha","48","1969-03-24","A-23,Valencia,Mundhwa","1256","Yes"
"Shivani","Adsar","55","1961-11-09","Saptami-234,Udita,Salt Lake","5485"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area","5555"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi"

在使用spark.read.option(delimiter,",").csv(filename)读取文件时,我可以正确地正确读取列地址,即使它包含','这是分隔符。

但是这种方法的问题在于,对于包含额外或更少列数的行,read函数分别在创建的数据框中截断或附加额外的分隔符。这不是所需的输出。

我想要的输出是包含所需数量的分隔符的行,在这种情况下为5。需要拒绝具有更多或更少分隔符的记录。

所以好的记录是:

"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi","9022"
"Shivani","Adsar","55","1961-11-09","Saptami-234,Udita,Salt Lake","5485"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area","5555"

我的不良记录是:

"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area"
"Tanvi","Mutha","48","1969-03-24","A-23,Valencia,Mundhwa","1256","Yes"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi"

如上所述阅读文件并不能让我识别不良记录。

如何做到这一点?

1 个答案:

答案 0 :(得分:0)

查看您的数据

"Name","Surname","Age","Birthdate","Address","PhoneNumber"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi","9022"
"Tanvi","Mutha","48","1969-03-24","A-23,Valencia,Mundhwa","1256","Yes"
"Shivani","Adsar","55","1961-11-09","Saptami-234,Udita,Salt Lake","5485"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area","5555"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi"

似乎有一个标题可用于数据框中的列名。您可以使用标题选项和格式选项,如下所示

spark.read
  .format("com.databricks.spark.csv")
  .option("header", true)
  .csv("path to your csv file")
  .show(false)

这应该为您提供输出数据框

+-------+-------+---+----------+---------------------------+-----------+
|Name   |Surname|Age|Birthdate |Address                    |PhoneNumber|
+-------+-------+---+----------+---------------------------+-----------+
|Chaitra|Shenoy |21 |1995-08-26|A-123,Spring blossom Area  |null       |
|Sapna  |Soni   |22 |1994-04-16|B-56,Ganga Park,Ghorpadi   |9022       |
|Tanvi  |Mutha  |48 |1969-03-24|A-23,Valencia,Mundhwa      |1256       |
|Shivani|Adsar  |55 |1961-11-09|Saptami-234,Udita,Salt Lake|5485       |
|Chaitra|Shenoy |21 |1995-08-26|A-123,Spring blossom Area  |5555       |
|Sapna  |Soni   |22 |1994-04-16|B-56,Ganga Park,Ghorpadi   |null       |
+-------+-------+---+----------+---------------------------+-----------+

我希望答案有帮助