我正在尝试测试udf(Spark Java函数),该代码在代码中可以很好地与数据集一起工作,但在Junit测试中却不能,这似乎是带有向量结构的解析错误,错误指定为:
Caused by: java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.ml.linalg.Vector
要包含哪些Vector类而不是VectorUDT()
?我找不到它们。
udf
标头:
public class CalculateM implements UDF2<Vector,Vector, Double> {
测试:
@Test
public void udfCalculateMTest() {
List<Row> data = Arrays.asList(
RowFactory.create(
Vectors.dense(new double[]{4.0, 5.0}),
Vectors.dense(new double[]{4.0, 7.0})
)
);
StructType schema = new StructType(new StructField[]{
new StructField("v1", new VectorUDT(), false, Metadata.empty()),
new StructField("v2", new VectorUDT(), false, Metadata.empty())
});
spark.createDataFrame(data, schema).createOrReplaceTempView("df");
spark.sqlContext().udf().registerJava("corr", CalculateM.class.getName(), DataTypes.DoubleType);
Row result = spark.sql("SELECT corr(v1,v2) from df").head();
Assert.assertEquals(2, result.getDouble(0), 1.0e-6);
}