如何打破长rdd沿袭,以避免stackoverflow

时间:2019-03-27 19:59:08

标签: apache-spark apache-spark-sql hdfs spark-checkpoint

我正在尝试将大量的小型avro文件(在hdfs中)合并为实木复合地板文件。看起来如果该目录中有大量的avro文件,我会收到一条ERROR yarn.ApplicationMaster:用户类引发了异常:java.lang.StackOverflowError

错误:

19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235700-dm-appd703-9abf19b8-2f6f-4341-87d7-74c0175e980d.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-DM-APPTSTD701-6af176ba-68f8-4420-b1b0-2f2be6abf003.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd701-70b0ff1c-1664-4ce7-8321-149e12961627.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd702-3dcbe094-14c9-4a4f-b326-57256df78b50.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd703-a8a3ef8b-4dc0-41c1-a69a-2ef432fee0af.avro on driver
19/03/26 15:14:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError
java.lang.StackOverflowError
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

我正在使用的代码

val df_array = filePaths.map(path => sqlContext.read.format("com.databricks.spark.avro").load(path.toString))
      val df_mid = df_array.reduce((df1, df2) => df1.unionAll(df2))
      val df = df_mid
        .withColumn("dt", date_format(df_mid.col("timeStamp"), "yyyy-MM-dd"))
        .filter("dt != 'null'")
      df
        .repartition(df.col("dt"))  //repartition vs coalese: https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
        .write.partitionBy("dt")
        .mode(SaveMode.Append)
        .option("compression","snappy")
        .parquet(avroConsolidator.parquetFilePathSpark.toString)

其中filePaths为Array [Path]。

如果我尝试处理的路径数量较少,则此代码有效。

经过一番摸索之后,我发现检查数据框可能是缓解此问题的一种选择,但我不确定如何实现。

火花版本:2.1

0 个答案:

没有答案
相关问题