从pyspark中的外部Java jar类注册UDF

时间:2018-09-06 22:05:56

标签: java python apache-spark pyspark apache-spark-sql

我有Java Jar,其中包含函数,例如

package com.test.oneid;
public class my_class {

    public static void main(String args[]) {

    }

    public static int add(int x) throws IOException {       
        try {
            return (x+2);
        } catch(Exception e) {
            throw new IOException("Caught exception processing input row ", e);
        }
    }
}

在我的Spark会话中,我将此罐子包含--jars选项。

from pyspark.sql.types import *
from pyspark.sql.functions import *
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"com.test.oneid.my_class")
my_func = sc._gateway.jvm.my_class()

def add_udf(s):
    x=my_func.add(s)
    return x

add_udf(10)

直到这里它仍然有效,但是当我尝试将其注册为UDF以在Spark SQL或Dataframe中使用时,出现以下错误,

>>> spark.udf.register('my_udf', add_udf, IntegerType())
Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/cloudpickle.py", line 147, in dump
    return Pickler.dump(self, obj)
  File "/data1/anaconda/anaconda2/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/data1/anaconda/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data1/anaconda/anaconda2/lib/python2.7/pickle.py", line 554, in save_tuple
    save(element)
  File "/data1/anaconda/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/hdp/current/spark2-client/python/pyspark/cloudpickle.py", line 248, in save_function
    self.save_function_tuple(obj)
  File "/usr/hdp/current/spark2-client/python/pyspark/cloudpickle.py", line 296, in save_function_tuple
    save(f_globals)
  File "/data1/anaconda/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data1/anaconda/anaconda2/lib/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/data1/anaconda/anaconda2/lib/python2.7/pickle.py", line 692, in _batch_setitems
    save(v)
  File "/data1/anaconda/anaconda2/lib/python2.7/pickle.py", line 306, in save
    rv = reduce(self.proto)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
    format(target_id, ".", name, value))
Py4JError: An error occurred while calling o57.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
        at py4j.Gateway.invoke(Gateway.java:272)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)

我不确定错误是什么,谢谢您的帮助。

编辑: 该错误在这里没有帮助,我也尝试遵循。

>>> sqlContext.registerJavaFunction("udf", "com.test.oneid.my_class.add_udf")
18/09/06 18:20:56 ERROR UDFRegistration: Can not load class com.test.oneid.my_class.add_udf, please make sure it is on the classpath

我需要注册它,以便我可以在Spark Sql中的case语句中使用它,但是在这方面,以下操作无济于事-Register UDF to SqlContext from Scala to use in PySpark

0 个答案:

没有答案
相关问题