基于另一个RDD中Id的成员资格分离RDD

时间:2017-09-22 14:30:27

标签: scala apache-spark

我有两个案例类和一个RDD。

case class Thing1(Id: String, a: String, b: String, c: java.util.Date, d: Double)
case class Thing2(Id: String, e:  java.util.Date, f: Double)

val rdd1 = // Loads an rdd of type RDD[Thing1]
val rdd2 = // Loads an rdd of type RDD[Thing2]

我想创建2个新的RDD [Thing1],1包含rdd1的元素,其中元素在rdd2中存在Id,另一个包含rdd1的元素,其中元素在rdd2中不存在Id < / p>

这是我尝试过的(看过这个,Scala Spark contains vs. does not contain和其他堆栈溢出帖子,但都没有用过)

val rdd2_ids = rdd2.map(r => r.Id)
val rdd1_present = rdd1.filter{case r => rdd2 contains r.Id}

val rdd1_absent = rdd1.filter{case r => !(rdd2 contains r.Id)}

但这让我错误error: value contains is not a member of org.apache.spark.rdd.RDD[String] 我已经看到很多关于SO的问题,询问如何做我想做的事情,但没有一个对我有用。我经常收到value _____ is not a member of org.apache.spark.rdd.RDD[String]错误。

为什么这些其他答案对我不起作用,我怎样才能实现我的目标?

2 个答案:

答案 0 :(得分:0)

我创建了两个简单的RDD

private string AuthOrCharge(ARequest req, bool ur = false) { ... }
private string AuthOrCharge(CRequest req, bool ur = false) { ... }
private string AuthOrCharge(PACRequest req, bool ur = false) { ... }
private string AuthOrCharge(VRequest req, bool ur = false) { ... }
private string AuthOrCharge(BCRequest req, bool ur = false) { ... }
private string AuthOrCharge(BRRequest req, bool ur = false) { ... }
private string AuthOrCharge(BCURequest req, bool ur = false) { ... }

现在,您可以通过要在其中找到共同值的相应元素加入它们:

private string AuthOrCharge(object req, bool ur = false) {
    throw new ArgumentException($"Unknown type: {req.GetType()}");
}

private string AuthOrChargeDispatch(dynamic req, bool ur = false) {
    return AuthOrCharge(req, ur);
}

答案 1 :(得分:0)

尝试完全外连接 -

val joined = rdd1.map(s=>(s.id,s)).fullOuterJoin(rdd2.map(s=>(s.id,s))).cache()

//only in left 
joined.filter(s=> s._2._2.isEmpty).foreach(println)

//only in right
joined.filter(s=>s._2._1.isEmpty).foreach(println)

//in both
joined.filter(s=> !s._2._1.isEmpty && !s._2._2.isEmpty).foreach(println)