使用FPGrowth进行训练时,Spark会产生StackOverflowError

时间:2016-06-29 07:47:24

标签: scala spark-streaming apache-spark-mllib

我在sparks的mllib中使用FPGrowth来找到频繁的模式。 这是我的代码:

object FPGrowthExample{
   def main(args:Array[String]){ 
       val conf = new SparkConf().setAppName("FPGrowthExample")
       val sc = new SparkContext(conf)
       val data = sc.textFile("/user/text").map(s => s.trim.split(" ")).cache()
       val fpg = new FPGrowth().setMinSupport(0.005).setNumPartitions(10)
       val model = fpg.run(data)
       val output = model.freqItemsets.map(itemset => itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
       output.repartition(1).saveAsTextFile("/user/result")
       sc.stop()
  }
}

当文本有800000行并且每行被视为doc时,spark会给出stackoverflower错误。 这是错误:

java.lang.StackOverflowError
at java.lang.Exception.<init>(Exception.java:102)
at java.lang.ReflectiveOperationException.<init>                                   
(ReflectiveOperationException.java:89)
at java.lang.reflect.InvocationTargetException.<init> 
(InvocationTargetException.java:72)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:137)
at   scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:135)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashTable$class.serializeTo(HashTable.scala:124)
at scala.collection.mutable.HashMap.serializeTo(HashMap.scala:39)
at scala.collection.mutable.HashMap.writeObject(HashMap.scala:135)

这是我的提交脚本:

/usr/local/webserver/spark-1.5.1-bin-2.6.0/bin/spark-submit --master yarn --   deploy-mode cluster 
--num-executors 30 --driver-memory 30g 
--executor-memory 30g --executor-cores 10 
--conf spark.driver.maxResultSize-10g --class FPGrowthExample project.jar

我不知道如何修复它,当输入只有1000行时运行良好。

0 个答案:

没有答案