在地图标量中拆分键值

时间:2016-03-17 16:29:39

标签: scala apache-spark

我不知道是否可能,但我想在我的mapPartitions中将变量分成两个列表" a"。就像这里有一个列表l存储所有数字和另一个列表让我们说b存储所有单词。像a.mapPartitions((p,v) =>{ val l = p.toList; val b = v.toList; ....}

这样的东西

例如在我的for循环中l(i)= 1和b(i)="得分"

import scala.io.Source
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ListBuffer

val a = sc.parallelize(List(("score",1),("chicken",2),("magnacarta",2)) )

a.mapPartitions(p =>{val l = p.toList;
    val ret = new ListBuffer[Int]
    val words = new ListBuffer[String]
    for(i<-0 to l.length-1){
    words+= b(i)
    ret += l(i) 
    }
ret.toList.iterator
}
)

1 个答案:

答案 0 :(得分:1)

Spark是一种分布式计算引擎。您可以跨群集的节点对分区数据执行操作。然后,您需要一个执行摘要操作的Reduce()方法。

请参阅此代码,该代码可以执行您想要的操作:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object SimpleApp {

  class MyResponseObj(var numbers: List[Int] = List[Int](), var words: List[String] = List[String]()) extends java.io.Serializable{
    def +=(str: String, int: Int) = {
      numbers = numbers :+ int
      words = words :+ str
      this
    }

    def +=(other: MyResponseObj) = {
      numbers = numbers ++ other.numbers
      words = words ++ other.words
      this
    }

  }


  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val a = sc.parallelize(List(("score", 1), ("chicken", 2), ("magnacarta", 2)))

    val myResponseObj = a.mapPartitions[MyResponseObj](it => {
      var myResponseObj = new MyResponseObj()
      it.foreach {
        case (str :String, int :Int) => myResponseObj += (str, int)
        case _ => println("unexpected data")
      }
      Iterator(myResponseObj)
    }).reduce( (myResponseObj1, myResponseObj2) => myResponseObj1 += myResponseObj2 )

    println(myResponseObj.words)
    println(myResponseObj.numbers)

  }
}