将RDD保留为Avro文件

时间:2015-11-01 16:36:43

标签: apache-spark avro

我编写了这个示例程序来将RDD保存到avro文件中。

我正在使用CDH 5.4和Spark 1.3

我编写了这个avsc文件,然后为类User

生成了代码
{"namespace": "com.abhi",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "firstname", "type": "string"},
     {"name": "lastname",  "type": "string"} ]
}

然后我为User

生成了代码
java -jar ~/Downloads/avro-tools-1.7.7.jar compile schema User.avsc .

我写了我的例子

package com.abhi

import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkConf
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroKeyOutputFormat, AvroJob, AvroKeyInputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.spark.SparkContext

object MySpark {
  def main(args : Array[String]) : Unit = {
    val sf = new SparkConf()
      .setMaster("local[2]")
      .setAppName("MySpark")
    val sc = new SparkContext(sf)

    val user1 = new User();
    user1.setFirstname("Test1");
    user1.setLastname("Test2");

    val user2 = new User("Test3", "Test4");

    // Construct via builder
    val user3 = User.newBuilder()
      .setFirstname("Test5")
      .setLastname("Test6")
      .build()

    val list = Array(user1, user2, user3)
    val userRdd = sc.parallelize(list)

    val job: Job = Job.getInstance()
    AvroJob.setOutputKeySchema(job, user1.getSchema)

    val output = "/user/cloudera/users.avro"
    userRdd.map(row => (new AvroKey(row), NullWritable.get()))
      .saveAsNewAPIHadoopFile(
        output,
        classOf[AvroKey[User]],
        classOf[NullWritable],
        classOf[AvroKeyOutputFormat[User]],
        job.getConfiguration)
  }
}

我对此代码有两个顾虑

部分导入来自旧的mapreduce api,我想知道为什么Spark代码需要它们

import org.apache.hadoop.mapreduce.Job
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroKeyOutputFormat, AvroJob, 
AvroKeyInputFormat}

当我将代码提交给hadoop集群时,代码会抛出异常 它确实在HDFS

中创建一个名为/user/cloudera/users.avro的空目录
15/11/01 08:20:42 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/11/01 08:20:42 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/01 08:20:42 INFO spark.SparkContext: Starting job: saveAsNewAPIHadoopFile at MySpark.scala:52
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Got job 1 (saveAsNewAPIHadoopFile at MySpark.scala:52) with 2 output partitions (allowLocal=false)
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Final stage: Stage 1(saveAsNewAPIHadoopFile at MySpark.scala:52)
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Missing parents: List()
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[2] at map at MySpark.scala:51), which has no missing parents
15/11/01 08:20:42 INFO storage.MemoryStore: ensureFreeSpace(66904) called with curMem=301745, maxMem=280248975
15/11/01 08:20:42 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 65.3 KB, free 266.9 MB)
15/11/01 08:20:42 INFO storage.MemoryStore: ensureFreeSpace(23066) called with curMem=368649, maxMem=280248975
15/11/01 08:20:42 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 22.5 KB, free 266.9 MB)
15/11/01 08:20:42 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:34630 (size: 22.5 KB, free: 267.2 MB)
15/11/01 08:20:42 INFO storage.BlockManagerMaster: Updated info of block broadcast_2_piece0
15/11/01 08:20:42 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[2] at map at MySpark.scala:51)
15/11/01 08:20:42 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/11/01 08:20:42 ERROR scheduler.TaskSetManager: Failed to serialize task 1, not attempting to retry it.
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:150)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:58)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:39)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
    at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:149)
    at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:464)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:232)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
    at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:227)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$6.apply(TaskSchedulerImpl.scala:296)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$6.apply(TaskSchedulerImpl.scala:294)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

1 个答案:

答案 0 :(得分:0)

问题是Spark无法序列化您的import RPi.GPIO as GPIO actionpin1 = 23 actionpin2 = 24 GPIO.setmode(GPIO.BCM) GPIO.setup(actionpin1, GPIO.IN, pull_up_down=GPIO.PUD_UP) GPIO.setup(actionpin2, GPIO.IN, pull_up_down=GPIO.PUD_UP) GPIO.add_event_detect(actionpin1, GPIO.BOTH, callback=action1,bouncetime=800) GPIO.add_event_detect(actionpin2, GPIO.BOTH, callback=action2, bouncetime=800) def action1(): print "button pressed 1" def action2(): print "button pressed 2" while True: print "waiting for button" 课程,请尝试设置User并在那里注册您的课程。

相关问题