Question

我正在从程序中读取数据（使用sc.parallelize）并且能够读取数据并对该数据集应用进一步的转换和操作。

val d = sc.parallelize(Seq(11 -> Seq(21,51,61,111,112), 
                            21 -> Seq(51,111,112,115,116), 
                            31-> Seq(61,111,112,117,121), 
                            41-> Seq(31,111,112,117,122)))

/* d of type val d: RDD[(Int, Seq[Int])]*/

val thes = 2
 val r = d
.flatMapValues(x=>x)
.map(_.swap)
.groupByKey
.map(_._2)
.flatMap(x=>expand(x.toSeq))
.map(_ -> 1)
.reduceByKey(_+_)
.filter(_._2>= thes)
.map(_._1)
.flatMap(x=> Seq(x._1 -> x._2, x._2 -> x._1))
.groupByKey.mapValues(_.toArray)
r.toDF().show() /*giving the expected output*/

我正在尝试从文件读取相同的数据集并能够读取数据，但在应用转换和操作时（与上面相同），我无法从字符串转换为整数。

/*inputfile
 id,sid
 11,"21,51,61,111,112"
 21,"51,111,112,115,116"
 31,"61,111,112,117,121"
 41,"31,111,112,117,122"
 */

val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").load(inputFile)
.rdd.map(x=> x.getAs[Int]("id") -> (x.getAs[String]("sid").split(",")
.toList.toSeq).map(_.toInt))

 val thes = 2
 /* df of type val df: RDD[(Int, Seq[Int])]*/

 val r = df
.flatMapValues(x=>x)
.map(_.swap)
.groupByKey
.map(_._2)
.flatMap(x=>expand(x.toSeq))
.map(_ -> 1)
.reduceByKey(_+_)
.filter(_._2>= thes)
.map(_._1)
.flatMap(x=> Seq(x._1 -> x._2, x._2 -> x._1))
.groupByKey.mapValues(_.toArray)

r1.toDF().show()

我在flatmap操作本身期间无法从字符串转换为整数（java.lang.ClassCastException：java.lang.String不能转换为java.lang.Integer）。即使df显示RDD [（Int，Seq [Int]）类型，也不确定字符串在图片中的位置。

def expand(seq : Seq[Int]): Seq[(Int, Int)] = 
if (seq.isEmpty) 
Seq[(Int, Int)]() 
else 
seq.tail.map(x=> seq.head -> x) ++ expand(seq.tail)

使用scala读取CSV文件

0 个答案: