
时间:2016-02-01 17:29:17

标签: scala apache-spark locality-sensitive-hash


D1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D2 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
D3 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1
D4 ... 


val input = "text.txt"
    val conf = new SparkConf()
    val storageLevel = StorageLevel.MEMORY_AND_DISK
    val sc = new SparkContext(conf)

    // read in an example data set of word embeddings
    val data = sc.textFile(input, numPartitions).map {
      line =>
        val split = line.split(" ")
        val word = split.head
        val features = split.tail.map(_.toDouble)
        (word, features)

    // create an unique id for each word by zipping with the RDD index
    val indexed = data.zipWithIndex.persist(storageLevel)

    // create indexed row matrix where every row represents one word
    val rows = indexed.map {
      case ((word, features), index) =>
        IndexedRow(index, Vectors.dense(features))

我想要做的是使用稀疏矩阵而不是使用密集。如何调整' Vectors.dense(功能)'?

1 个答案:

答案 0 :(得分:0)

稀疏向量的等效工厂方法是Vectors.sparse,它需要索引数组和非零条目的相应数组值。 cosine-lsh-join-spark库中的方法签名基于一般的Vector类,因此看起来该库将接受稀疏或密集向量。