PySpark - 使用mongo-spark连接器从MongoDB读取数据会导致MongoQueryException超出文档大小

时间:2016-07-11 03:38:04

标签: mongodb apache-spark apache-spark-sql spark-dataframe pyspark-sql

我尝试使用MongoDB的新Spark connector来读取MongoDB中的数据。我在启动应用程序时向Spark conf对象提供了数据库和集合详细信息。然后使用以下代码读取数据框

reader = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
df = reader.options(partitioner='MongoSplitVectorPartitioner').load()

然后将此数据框写入镶木地板文件

df.write.parquet(/destination/path)

它启动一项包含许多任务的工作,除了上一个任务之外,所有任务都成功完成,这导致整个写作工作失败。我在文档大小超过16 MB时遇到异常。

org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:269)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.mongodb.MongoQueryException: Query failed with error code 16493 and error message 'Tried to create string longer than 16MB' on server xx.xxx.xx.xx:27018
at com.mongodb.connection.ProtocolHelper.getQueryFailureException(ProtocolHelper.java:131)
at com.mongodb.connection.GetMoreProtocol.execute(GetMoreProtocol.java:96)
at com.mongodb.connection.GetMoreProtocol.execute(GetMoreProtocol.java:49)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:286)
at com.mongodb.connection.DefaultServerConnection.getMore(DefaultServerConnection.java:251)
at com.mongodb.operation.QueryBatchCursor.getMore(QueryBatchCursor.java:218)
at com.mongodb.operation.QueryBatchCursor.hasNext(QueryBatchCursor.java:103)
at com.mongodb.MongoBatchCursorAdapter.hasNext(MongoBatchCursorAdapter.java:46)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:118)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:110)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1801)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)

我不维护Mongo数据库。因此,我不确定首先存在大于16 MB的文档。可能可以使用GridFS。 有没有办法可以跳过处理不良记录?

我尝试使用udf进行过滤,但同样的错误也失败了

import sys
from pyspark.sql.functions import udf, col

size_filter_udf = udf(lambda entry: sys.getsizeof(entry), IntegerType())

filtered_df = df.where(size_filter_udf(col("caseNotes")) < 16000000)

filtered_df.write.parquet(/destination/path)

0 个答案:

没有答案