为什么在群集模式下读取CSV文件失败(在本地工作正常)?

时间:2016-04-01 10:13:15

标签: scala apache-spark apache-spark-sql

读取csv会导致某些csv文件失败。相同的代码适用于不同的csv文件。 Printschema打印架构。 df.show会导致分段错误。

 val df=sqlContext.read.format("csv").option("header","true").option("inferSchema","true").load("hdfs://myIp:9000/data/time.csv")
df.printSchema
df.show

time.csv

Date,norm date,lala
1302820211,"Thu, 14 Apr 2011 22:30:11 GMT",2016-03-28
1372820211,"Wed, 03 Jul 2013 02:56:51 GMT",2016-03-28
1304820211,"Sun, 08 May 2011 02:03:31 GMT",2016-03-28

错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 20, slave03): java.lang.NullPointerException
at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$2.apply(CsvRelation.scala:120)
at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$2.apply(CsvRelation.scala:107)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

1 个答案:

答案 0 :(得分:0)

添加apache commons-csv jar和databricks-csv jar为我工作。最初我只包括databricks-csv jar

val conf = new SparkConf().setAppName("TEST APP")
    .setMaster("spark://ip:7077")
    .setJars(Seq("pathto/spark-csv_2.10-1.4.0.jar",
    "pathTo/commons-csv-1.1.jar"))
  val sc=SparkContext.getOrCreate(conf)
  val sqlContext = new SQLContext(sc)
相关问题