如何在电子邮件ID中找到ngram?

时间:2018-11-28 05:23:28

标签: scala apache-spark user-defined-functions apache-spark-mllib apache-spark-ml

我需要在spark scala中创建自定义特征转换器。例如,我有一个scala数据框

+--------------------+ .  
|          email_list| .  
+--------------------+ .  
|testmail1115@gmail.com| .  
|mavenmaven@mlail.com| .  
|dnd.7899334622@gmail.com| .  
+--------------------+ .  

如果我使用转换器,它将输入的字符串数组转换为n元语法的数组,如下所示:

+--------------------+--------------------+  
|          email_list|              ngrams| .  
+--------------------+--------------------+   
|testmail1115@gmail.com|[t e, e s, s t, t...|  
|mavenmaven@mlail.com|[m a, a v, v e, e...| .  
|dnd.7899334622@gmail.com|[d n, n d, d...| .  
+--------------------+--------------------+ .  

如何在下面的代码中显示不同的ngram而不是模式或数组:

import org.apache.spark.ml.feature.NGram
val emailD1F=emailDF.withColumn("email_split", split(col("email_list"), "@").getItem(0)).withColumn("email_split", split(col("email_split"), "")) .   
val ngram = new NGram().setN(2).setInputCol("col1").setOutputCol("ngrams")

val ngramDataFrame = ngram.transform(emailD1F)
ngramDataFrame.show()

0 个答案:

没有答案