kmeans pyspark org.apache.spark.SparkException:作业由于阶段故障而中止

时间:2020-07-17 10:15:03

标签: apache-spark pyspark k-means

我想在我的基础上使用k均值(670万行和22个变量),

base.dtypes

 ('anonimisation2', 'double'),
 ('anonimisation3', 'double'),
 ('anonimisation4', 'double'),
 ('anonimisation5', 'double'),
 ('anonimisation6', 'double'),
 ('anonimisation7', 'double'),
 ('anonimisation8', 'double'),
 ('anonimisation9', 'double'),
 ('anonimisation10', 'double'),
 ('anonimisation11', 'double'),
 ('anonimisation12', 'double'),
 ('anonimisation13', 'double'),
 ('anonimisation14', 'double'),
 ('anonimisation15', 'double'),
 ('anonimisation16', 'double'),
 ('anonimisation17', 'double'),
 ('anonimisation18', 'double'),
 ('anonimisation19', 'double'),
 ('anonimisation20', 'double'),
 ('anonimisation21', 'double'),
 ('anonimisation22', 'double')]

我读到我应该使用此代码:

def transData(base):
    return base.rdd.map(lambda r: [Vectors.dense(r[:-1])]).toDF(['features'])
transformed= transData(base)
transformed.show(5, False)

然后我写了这个:

kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(transformed)

我有这个错误:

IllegalArgumentException: 'requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'

不知道该怎么办? 如果您想了解更多信息,请询问 谢谢

我试图在Pandas上使用python,但那里也有问题

1 个答案:

答案 0 :(得分:0)

使用from pyspark.ml.linalg import Vectors代替from pyspark.mllib.linalg import Vectors

相关问题