Question

你可以帮我优化这段代码并让它运作吗？这是原始数据：

+--------------------+-------------+
|       original_name|medicine_name|
+--------------------+-------------+
|         Venlafaxine|  Venlafaxine|
|    Lacrifilm 5mg/ml|    Lacrifilm|
|    Lacrifilm 5mg/ml|         null|
|         Venlafaxine|         null|
|Vitamin D10,000IU...|         null|
|         paracetamol|         null|
|            mucolite|         null|

我希望得到这样的数据

+--------------------+-------------+
|       original_name|medicine_name|
+--------------------+-------------+
|         Venlafaxine|  Venlafaxine|
|    Lacrifilm 5mg/ml|    Lacrifilm|
|    Lacrifilm 5mg/ml|    Lacrifilm|
|         Venlafaxine|  Venlafaxine|
|Vitamin D10,000IU...|         null|
|         paracetamol|         null|
|            mucolite|         null|

这是代码：

distinct_df = spark.sql("select distinct medicine_name as medicine_name from medicine where medicine_name is not null")
distinct_df.createOrReplaceTempView("distinctDF")

def getMax(num1, num2):
    pmax = (num1>=num2)*num1+(num2>num1)*num2
    return pmax

def editDistance(s1, s2):
    ed = (getMax(length(s1), length(s2)) - levenshtein(s1,s2))/
          getMax(length(s1), length(s2))
    return ed

editDistanceUdf = udf(lambda x,y: editDistance(x,y), FloatType())

def getSimilarity(str):
    res = spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")
    res['medicine_name'].take(1)
    return res

getSimilarityUdf = udf(lambda x: getSimilarity(x), StringType())
res_df = df.withColumn('m_name', when((df.medicine_name.isNull)|(df.medicine_name.=="null")),getSimilarityUdf(df.original_name)
.otherwise(df.medicine_name)).show()

现在我收到错误：

command_part = REFERENCE_TYPE + parameter._get_object_id（） AttributeError：＆＃39;功能＆＃39;对象没有属性＆＃39; _get_object_id＆＃39;

Answer 1

您的代码存在许多问题：

您无法在SparkSession中使用udf或分布式对象。所以getSimilarity无法正常工作。如果你想比较这样的对象，你必须join。
如果length和levenshtein来自pyspark.sql.functions，则UserDefinedFunctions内无法使用*Column。设计用于生成SQL表达式，从Column映射到isNull。
列property是一种非df.medicine_name.isNull()的方法，因此应该调用：
```
df.medicine_name.=="null"
```
关注
```
SparkSession
```
不是语法上有效的Python（看起来像Scala calque）并且会抛出编译器异常。
如果UserDefinedFunction允许spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")访问，则这不是有效的替换
```
spark.sql("select medicine_name, editDistanceUdf({str}, medicine_name) from distinctDf where editDistanceUdf({str}, medicine_name)>=0.85 order by 2".format(str=str))
```
您应该使用字符串格式化方法
```
crossJoin
```
也许还有其他一些问题，但由于你没有提供MCVE，其他任何东西都是纯粹的猜测。

当您修复较小的错误时，您有两种选择：

使用combined = df.alias("left").crossJoin(spark.table("distinctDf").alias("right"))：
```
udf
```
然后将var form_data = new FormData(this); $.ajax({ url: "api/register", method: "POST", type: "POST", headers: { "cache-control": "no-cache" }, cache: false, data: form_data, contentType: false, processData: false, async: true }).success(function (res) { alert ('uploaded'); }).error(function (res) { alert('not uploaded') });，过滤器和Find maximum row per group in Spark DataFrame中列出的方法之一应用于组中最接近的匹配。
使用内置的近似匹配工具，如Efficient string matching in Apache Spark

Udf不工作

1 个答案: