使用两个RDD

时间:2018-08-17 06:39:37

标签: apache-spark pyspark rdd

我有两个RDD。第一个包含

(pID, Name, Price, Column1)

第二个包含

(pID, Seller, Column3)

我想获得pID相同的第3列。我仍然想保留第一个RDD格式。我无法弄清楚输出这个逻辑。我也对功能编程逻辑感到困扰。请帮帮我。

1 个答案:

答案 0 :(得分:1)

    val as = List((101, ("iteam A", 1.24)),
      (102, ("iteam B", 2.45)),
      (103, ("iteam C", 3.54)))
    val rdd1 = sc.parallelize(as) // Pair Rdd with key = pId, value = (name, price)

    val ls = List((101, "Seller A"),
      (101, "Seller B"),
      (102, "Seller C"),
      (102, "Seller D"),
      (103, "Seller E"))
    val rdd2 = sc.parallelize(ls) // Pair Rdd with key = pId, value = (seller)

    //call inner join:
     val innerJoinedRdd = rdd1.join(rdd2)
     innerJoinedRdd.collect().foreach(println)

    (101,((iteam A,1.24),Seller A))
    (101,((iteam A,1.24),Seller B))
    (102,((iteam B,2.45),Seller C))
    (102,((iteam B,2.45),Seller D))
    (103,((iteam C,3.54),Seller E))