为什么我得到一张空地图?

时间:2018-01-06 08:10:10

标签: scala apache-spark

源数据

scala> dataframe.show
+--------------------+--------------------+
|           moid|        features|
+--------------------+--------------------+
|0031222c889642608...|(5,[0,1,2,3,4],[0...|
|0013103228494a7b9...|(5,[0,2,3,4],[0.1...|
|003e1996e51a435e8...|(5,[0,2,3,4],[0.2...|
|0044b270064342ac8...|(5,[0,1,2,3,4],[0...|
|00b36594a2a644f09...|(5,[0,1,2,3,4],[0...|
|00e8387be566492c9...|(5,[0,1,2,3,4],[0...|
|01158f88e19148b39...|(5,[0,1,3,4],[0.1...|
|011952d6c52b43019...|(5,[0,1,2,3,4],[0...|
|0156b479932b449bb...|(5,[0,1,2,3,4],[0...|
|015fb90315cc43b19...|(5,[0,1,2,3,4],[0...|
|0186aa87f3f04d1d8...|(5,[0,1,2,4],[0.2...|
|019bc8d4096e41ad8...|(5,[0,1,3,4],[0.4...|
|0224ed4d3d5d4a3ca...|(5,[0,1,2,3,4],[0...|
|0279fd0bb2f2458ba...|(5,[0,1,2,3,4],[0...|
|02847207432d4de9a...|(5,[0,1,2,4],[0.2...|
|028715c44bac423f8...|(5,[1,2,4],[0.243...|
|02ccf2c118a046e69...|(5,[1,2,4],[0.243...|
|005a55b9a230452b9...|(5,[0,2,3,4],[0.2...|
|02e02d27ce13448db...|(5,[0,1,2,3,4],[0...|
|013150a3c5fc42d88...|(5,[0,1,2,4],[0.1...|
+--------------------+--------------------+

scala> dataframe.printSchema
root
 |-- moid: string (nullable = false)
 |-- features: vector (nullable = true)


vector :org.apache.spark.ml.linalg.SparseVector

我想计算每一行之间的余弦相似度,然后按相似性得到每一行的前十项,最后得到&#39; top_sim_map&#39; 。< /强>
val top_sim_map = Map[String,Array[(String,Double)]]()

这是我做的:

def cosineSimilarity(vectorA: org.apache.spark.ml.linalg.SparseVector, vectorB: org.apache.spark.ml.linalg.SparseVector):Double = {
    var dotProduct = 0.0
    var normA = 0.0
    var normB = 0.0
    var index = vectorA.size - 1
    for (i <- 0 to index) {
      dotProduct += vectorA(i) * vectorB(i)
      normA += Math.pow(vectorA(i), 2)
      normB += Math.pow(vectorB(i), 2)
    }
    (dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)))
  }


val rddData = dataframe.rdd
val rddDataLocal = rddData.collect()
val br_rddDataLocal = spark.sparkContext.broadcast(rddDataLocal)  
val top_sim_map = Map[String,Array[(String,Double)]]()
rddData.foreach((r:Row)=>{
      val moid = r.getString(0)
      val vec_a = r.getAs[org.apache.spark.ml.linalg.SparseVector](1)
      var simArr:Array[(String,Double)] = Array(("0",0.0),("0",0.0),("0",0.0),("0",0.0),("0",0.0),    ("0",0.0),("0",0.0),("0",0.0),("0",0.0),("0",0.0))
      br_rddDataLocal.value.foreach((row_tg:Row)=>{
        val num_b:String = row_tg.getString(0)
        val vec_b = row_tg.getAs[org.apache.spark.ml.linalg.SparseVector](1)
        val sim:Double = cosineSimilarity(vec_a,vec_b)
        simArr = simArr.map((t)=>{
          if(simArr.min._2>sim) (num_b,sim) else t
        })
      })
      top_sim_map += {moid->simArr}
    })

我的问题是为什么 top_sim_map 为空?

scala> top_sim_map.size
res36: Int = 0
scala> top_sim_map.isEmpty
res37: Boolean = true
scala> top_sim_map.take(100).foreach(println)
scala> 

0 个答案:

没有答案