pyspark:org.apache.spark.SparkException:无法执行用户定义的函数($ anonfun $ 11:(vector)=> vector)

时间:2017-10-02 04:40:33

标签: pyspark-sql

列车数据和测试数据都有52个特征,具有相同的维度。提取特征的方式是相同的。该程序没有语法错误。当我添加负样本的随机样本时,发生了错误。

17/10/02 10:28:23 ERROR HiveMetaStore: Failed to delete table directory: file:/E:/tianchi_taobao/tianchi2/spark-warehouse/re Got exception: org.apache.hadoop.hive.metastore.api.MetaException Unable to delete directory: file:/E:/tianchi_taobao/tianchi2/spark-warehouse/re
Traceback (most recent call last):
  File "E:/tianchi_taobao/tianchi2/test4.py", line 256, in <module>
    spark.sql("create table re as SELECT user_id,item_id FROM result WHERE prediction>0 ")
  File "D:\Anaconda3\lib\site-packages\pyspark\sql\context.py", line 360, in sql
    return self.sparkSession.sql(sqlQuery)
  File "D:\Anaconda3\lib\site-packages\pyspark\sql\session.py", line 543, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "D:\spark-2.0.2-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip\py4j\java_gateway.py", line 1133, in __call__
  File "D:\Anaconda3\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "D:\spark-2.0.2-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 25.0 failed 1 times, most recent failure: Lost task 4.0 in stage 25.0 (TID 3881, localhost): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$11: (vector) => vector)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.foreach(WholeStageCodegenExec.scala:368)
    at org.apache.spark.sql.hive.SparkHiveWriterContainer.writeToFile(hiveWriterContainers.scala:185)
    at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:131)
    at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:131)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException: key not found: 0.006578947368421052
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:59)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:59)
    at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:324)
    at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:323)
    at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
    at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:323)
    at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:317)
    at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:362)
    at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:362)
    ... 14 more

1 个答案:

答案 0 :(得分:0)

我不详细了解您的功能,因此我假设您有52个单独的列,并且您正在使用VectorAssemlber合并它们。

首先,检查您是否可以组装所有功能。如果你有矢量和原始特征,你就不能合并它们。

然后检查您的功能之间是否有一些索引器(StringIndexerVectorIndexer)。如果是肯定的,请注意在训练和测试集中所有可能的情况都可用:也许您只在测试集中有正/负标签。

此问题可能与this答案重复。