我试图通过利用HyperLogLogPlus实现来使用pyspark中不同的近似计数。但是,如果我尝试这样做:
from py4j.java_gateway import java_import
df = spark.createDataFrame([("4G_band1800", 12.0, 18.0, "TRUE"),
("4G_band1800", 12.0, 18.0, "FALSE"),
("4G_band1801", np.nan, 18.0, "TRUE"),
("4G_band1801", None, 18.0, "TRUE")],
("band", "A3", "A5", "status"),3)
java_import(sc._gateway.jvm, "com.clearspring.analytics.stream.cardinality.HyperLogLogPlus")
hp = sc._gateway.jvm.HyperLogLogPlus(4, 16)
def mapper(x):
if x:
hp.offer(x)
# do something else
return hp
def reducer(hp1, hp2):
hp1.addAll(hp2)
return hp1
a = df.rdd.mapPartitions(mapper).reduce(reducer)
a.cardinality()
我会遇到此错误:
py4j.protocol.Py4JError: An error occurred while calling o39.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
经过一些分析,似乎该错误是由于pickle无法序列化hp
所致。
我尝试了pyspark.sql.functions.approx_count_distinct(col, rsd=None)
,但这正在处理数据框列,而我需要可以在mappartition内部工作的东西。无论如何,我们可以直接在map方法中使用java或scala类吗?
谢谢!