PySpark在简单的.map()调用

时间:2015-09-28 22:10:28

标签: python apache-spark yarn pyspark

我正在使用PySpark进行一些简单的转换,并且不断碰到 'bool'对象是不可调用的 错误。 Spark版本是1.3.0。

我在其他一些地方(例如herehere)遇到了这个问题,但是建议似乎只是为了验证主要的python版本在驱动程序和工作人员,我已经完成了(每个都是一个带有python版本2.7.10的Anaconda发行版)。

为了调试这个,我一直在使用存储在HDFS中的虹膜数据集:

data = sc.textFile("/path/to/iris.csv")
data.count()  # works fine, returns 150
data.map(lambda x: x[:2])  # just subsets the string, works fine
data.map(lambda x: x.split(','))  # throws error below

这些(显然)在调用.collect().take().count()时失败,并且评估了地图调用。所以,我基本上正在寻找任何进一步的想法/事情来尝试正确配置。

15/09/28 17:55:08 INFO YarnScheduler: Removed TaskSet 14.0, whose tasks have all completed, from pool: 
An error occurred while calling o135.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1     in stage 14.0 failed 4 times, most recent failure: Lost task 1.3 in stage 14.0:     org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-assembly-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar/pyspark/worker.py", line 101, in main
process()
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-assembly-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 270, in func
return f(iterator)
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 933, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 933, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "<stdin>", line 1, in <lambda>
**TypeError: 'bool' object is not callable**

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

0 个答案:

没有答案
相关问题