Memory leak in Apache Spark Dataframe

时间:2016-04-04 18:44:32

标签: apache-spark apache-spark-sql

I am receiving the following error while testing a Spark app (written in scala). I am submitting the job to Spark local mode. My intention is to process sensor data using Spark Dataframe and group the data by week of the year. This is just a prototype app.

16/04/04 23:49:06 WARN memory.TaskMemoryManager: leak 16.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap@70fbb930
16/04/04 23:49:06 ERROR executor.Executor: Managed memory leak detected; size = 17039360 bytes, TID = 1
16/04/04 23:49:06 ERROR executor.Executor: Managed memory leak detected; size = 17039360 bytes, TID = 0
16/04/04 23:49:06 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NumberFormatException: multiple points
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1890)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at java.text.DigitList.getDouble(DigitList.java:169)
    at java.text.DecimalFormat.parse(DecimalFormat.java:2056)
    at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1869)
    at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
    at java.text.DateFormat.parse(DateFormat.java:364)
    at SensorStreaming$.to_date(SensorsStreaming.scala:24)
...

I am running the following piece of code written in Scala. I am using Apache Spark 1.6.0. While grouping does not work, simple select query (with no grouping) on the same temp table works just fine. I am using org.apache.spark.sql.functions.weekofyear function which was introduced in Spark 1.5.

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext, functions}

import java.text.SimpleDateFormat
import java.sql.Date


case class Sensor(resid: String, 
      date: java.sql.Date, 
      time: String, 
      hz: Double, 
      disp: Double, 
      flo: Double, 
      sedPPM: Double, 
      psi: Double, 
      chlPPM: Double)

object SensorStreaming {

  private val formatter = new SimpleDateFormat("M/d/y")

  def to_date(s:String):java.sql.Date = {
    new java.sql.Date(formatter.parse(s).getTime)
  }

  def parse(splits: Array[String]):Sensor = {

    Sensor(splits(0), 
        to_date(splits(1)), 
        splits(2), 
        splits(3).toDouble, 
        splits(4).toDouble, 
        splits(5).toDouble, 
        splits(6).toDouble, 
        splits(7).toDouble, 
        splits(8).toDouble)
  }

  def main(args: Array[String]){

      val conf = new SparkConf()
                .setAppName("SensorStreamingApp")
                .setMaster("local[*]")
      val sc = new SparkContext(conf)
      val sqlContext = new SQLContext(sc)
      import sqlContext._
      import sqlContext.implicits._

      val file = "/Volumes/SONY/Data/sensor_data/sensordata.csv"
      val rdd = sc.textFile(file).map(_.split(","))

      val df = rdd.map(parse).toDF
      df.registerTempTable("sensors");
      sqlContext.sql("select weekofyear(date) from sensors group by weekofyear(date)").show
  }
}

0 个答案:

没有答案