如何在pyspark中使用HyperLogLogPlus

时间:2019-10-18 00:14:43

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我试图通过利用HyperLogLogPlus实现来使用pyspark中不同的近似计数。但是,如果我尝试这样做:

from py4j.java_gateway import java_import
df = spark.createDataFrame([("4G_band1800", 12.0, 18.0, "TRUE"),
                            ("4G_band1800", 12.0, 18.0, "FALSE"),
                            ("4G_band1801", np.nan, 18.0, "TRUE"),
                            ("4G_band1801", None, 18.0, "TRUE")],
                            ("band", "A3", "A5", "status"),3)
java_import(sc._gateway.jvm, "com.clearspring.analytics.stream.cardinality.HyperLogLogPlus")

hp = sc._gateway.jvm.HyperLogLogPlus(4, 16)


def mapper(x):
    if x:
        hp.offer(x)
        # do something else
    return hp

def reducer(hp1, hp2):
    hp1.addAll(hp2)
    return hp1

a = df.rdd.mapPartitions(mapper).reduce(reducer)
a.cardinality()

我会遇到此错误:

py4j.protocol.Py4JError: An error occurred while calling o39.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

经过一些分析,似乎该错误是由于pickle无法序列化hp所致。 我尝试了pyspark.sql.functions.approx_count_distinct(col, rsd=None),但这正在处理数据框列,而我需要可以在mappartition内部工作的东西。无论如何,我们可以直接在map方法中使用java或scala类吗? 谢谢!

0 个答案:

没有答案
相关问题