将JavaPairRDD转换为不带collect()的列表

时间:2018-08-30 14:39:21

标签: scala apache-spark hadoop groovy

我的spark作业在collect()语句中崩溃,并出现以下错误。 强调textBelow是我收到的错误。

Java.lang.OutOfMemoryError: GC overhead limit exceeded
    at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:83)
    at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:105)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:239)
    at org.oclc.wcsync.hadoop.serverdsl.record.InputRecord.<init>(InputRecord.groovy:50)
    at org.oclc.wcsync.hadoop.serverdsl.record.InputRecordConstructorAccess.newInstance(Unknown Source)
    at com.twitter.chill.Instantiators$$anonfun$reflectAsm$1.apply(KryoBase.scala:141)
    at com.twitter.chill.Instantiators$$anon$1.newInstance(KryoBase.scala:125)
    at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1090)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:570)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:546)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
    at com.twitter.chill.Tuple1Serializer.read(TupleSerializers.scala:30)
    at com.twitter.chill.Tuple1Serializer.read(TupleSerializers.scala:23)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:246)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
    at org.apache.spark.util.collection.ExternalSorter$SpillReader.org$apache$spark$util$collection$ExternalSorter$SpillReader$$readNextItem(ExternalSorter.scala:558)
18/08/29 15:20:06 INFO storage.DiskBlockManager: Shutdown hook called
18/08/29 15:20:06 INFO util.ShutdownHookManager: Shutdown hook called

这是代码:

JavaRDD<File> myRecords = sc.parallelize(mapper.myFunction(records.collect())).cache()

myFunction使用“记录”列表并遍历它们。所以我正在使用records.collect()并将其传递到“ myFunction”中。但是collect()语句将所有数据带到驱动程序并导致此错误。我正在寻找可以用来避免此错误的任何替代方法。我知道可以使用count来代替collect,但是我在这里需要一个列表。

  List myFunction(List<scala.Tuple2<String, Tuple1<List<Record>>>> data) {
        List<File> list = []

        // Iterate through the List of Tuple2 instance


        list
    }

非常感谢您的帮助。

0 个答案:

没有答案
相关问题