Spark中的分类标记点

时间:2016-01-05 21:09:37

标签: scala apache-spark

我正在尝试在this电信数据集上运行多个分类器来预测客户流失。到目前为止,我已将我的数据集加载到Spark RDD中,但我不确定如何选择一列作为标签 - 在本例中为最后一列。不要求代码,而是简要说明RDD和LabeledPoint如何协同工作。我查看了官方Spark github中提供的示例,但它们似乎使用了libsvm格式。

问题:LabeledPoint如何工作,如何指定我的标签?

我的代码到目前为止,如果它有帮助:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}

object{
   def main(args: Array[String]): Unit = {
    //setting spark context
    val conf = new SparkConf().setAppName("Churn")
    val sc = new SparkContext(conf)
    //loading and mapping data into RDD
    val csv = sc.textFile("file://filename.csv")
    val data = csv.map(line => line.split(",").map(elem => elem.trim))
    /* computer learns which points are features and labels here */
}
}

数据集如下所示:

State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
KS,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,False.
OH,107,415,371-7191,no,yes,26,161.600000,123,27.470000,195.500000,103,16.620000,254.400000,103,11.450000,13.700000,3,3.700000,1,False.
NJ,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,False.

1 个答案:

答案 0 :(得分:1)

您需要确定自己的功能:例如,电话号码不是功能。因此,一些列将被删除。然后,您希望将字符串列转换为数字。是的,你可以用ML变形金刚来做,但在这种情况下它是一种矫枉过正。我这样做(在一行数据上显示逻辑):

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

val line = "NJ,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,False"
val arrl = line.split(",").map(_.trim)
val mr = Map("no"-> "0.0", "yes"-> "0.0", "False"->"0.0", "True" ->"1.0")
val stringvec = Array( arrl(2), mr(arrl(4)), mr(arrl(5))   ) ++ arrl.slice(6, 20)

val label = mr(arrl(20)).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint( label, Vectors.dense(vec))

所以,回答你的问题:标记点是目标变量(在这种情况下,最后一列(作为双精度),客户是否被搅动),加上数字(双重)特征向量描述客户(在这种情况下为vec)。