Question

我已经定义了两个这样的表：

 val tableName = "table1"
    val tableName2 = "table2"

    val format = new SimpleDateFormat("yyyy-MM-dd")
      val data = List(
        List("mike", 26, true),
        List("susan", 26, false),
        List("john", 33, true)
      )
    val data2 = List(
        List("mike", "grade1", 45, "baseball", new java.sql.Date(format.parse("1957-12-10").getTime)),
        List("john", "grade2", 33, "soccer", new java.sql.Date(format.parse("1978-06-07").getTime)),
        List("john", "grade2", 32, "golf", new java.sql.Date(format.parse("1978-06-07").getTime)),
        List("mike", "grade2", 26, "basketball", new java.sql.Date(format.parse("1978-06-07").getTime)),
        List("lena", "grade2", 23, "baseball", new java.sql.Date(format.parse("1978-06-07").getTime))
      )

      val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_))
      val rdd2 = sparkContext.parallelize(data2).map(Row.fromSeq(_))
      val schema = StructType(Array(
        StructField("name", StringType, true),
        StructField("age", IntegerType, true),
        StructField("isBoy", BooleanType, false)
      ))
    val schema2 = StructType(Array(
        StructField("name", StringType, true),
        StructField("grade", StringType, true),
        StructField("howold", IntegerType, true),
        StructField("hobby", StringType, true),
        StructField("birthday", DateType, false)
      ))

      val df = sqlContext.createDataFrame(rdd, schema)
      val df2 = sqlContext.createDataFrame(rdd2, schema2)
      df.createOrReplaceTempView(tableName)
      df2.createOrReplaceTempView(tableName2)

我正在尝试构建查询以返回table1中没有table2中匹配行的行。我尝试使用此查询来执行此操作：

Select * from table1 LEFT JOIN table2 ON table1.name = table2.name AND table1.age = table2.howold AND table2.name IS NULL AND table2.howold IS NULL

但这只是给了我table1的所有行：

列表（{ “名称”： “约翰”， “年龄”：33， “isBoy”：真} { “名”： “苏珊”， “年龄”：26， “isBoy”：假}， { “名称”： “迈克”， “年龄”：26， “isBoy”：真}）

如何有效地在Spark中进行这种类型的连接？

我正在寻找一个SQL查询，因为我需要能够指定要在两个表之间进行比较的列，而不是像在其他推荐的问题中那样逐行进行比较。比如使用减法，除了等等。

Answer 1

你可以使用＆＃34;左反＆＃34;连接类型 - 使用DataFrame API或SQL（DataFrame API支持SQL支持的所有内容，包括您需要的任何连接条件）：

DataFrame API：

df.as("table1").join(
  df2.as("table2"),
  $"table1.name" === $"table2.name" && $"table1.age" === $"table2.howold",
  "leftanti"
)

SQL：

sqlContext.sql(
  """SELECT table1.* FROM table1
    | LEFT ANTI JOIN table2
    | ON table1.name = table2.name AND table1.age = table2.howold
  """.stripMargin)

注意：还值得注意的是，使用元组和隐式{{1}，可以更简洁，更简洁地创建示例数据，而无需单独指定架构方法，然后＆＃34;修复＆＃34;需要时自动推断的架构：

toDF

Answer 2

你可以使用内置函数except来完成（我会使用你提供的代码，但你没有包含导入，所以我不能只是c / p它:(）

val a = sc.parallelize(Seq((1,"a",123),(2,"b",456))).toDF("col1","col2","col3")
val b= sc.parallelize(Seq((4,"a",432),(2,"t",431),(2,"b",456))).toDF("col1","col2","col3")

scala> a.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a| 123|
|   2|   b| 456|
+----+----+----+


scala> b.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   4|   a| 432|
|   2|   t| 431|
|   2|   b| 456|
+----+----+----+

scala> a.except(b).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a| 123|
+----+----+----+

Answer 3

您可以使用左防。

dfRcc20.as("a").join(dfClientesDuplicados.as("b")
  ,col("a.eteerccdiid")===col("b.eteerccdiid")&&
    col("a.eteerccdinr")===col("b.eteerccdinr")
  ,"left_anti")

Answer 4

在SQL中，您只需在下面查询即可（不确定它是否在SPARK中工作）

Select * from table1 LEFT JOIN table2 ON table1.name = table2.name AND table1.age = table2.howold where table2.name IS NULL

这将返回table1的所有连接失败的行

Answer 5

Left Anti Join in dataset spark java:

A left anti join returns that all rows from the first dataset which do not have a match in the second dataset.

Example with code:

/*Read data from Employee.csv */
Dataset<Row> employee = sparkSession.read().option("header", "true")
                .csv("C:\\Users\\Desktop\\Spark\\Employee.csv");
employee.show();

/*Read data from Employee1.csv */
Dataset<Row> employee1 = sparkSession.read().option("header", "true")
                .csv("C:\\Users\\Desktop\\Spark\\Employee1.csv");
employee1.show();

/*Apply left anti join*/
Dataset<Row> leftAntiJoin = employee.join(employee1, employee.col("name").equalTo(employee1.col("name")), "leftanti");

leftAntiJoin.show();

Output:

1) Employee dataset
+-------+--------+-------+
|   name| address| salary|
+-------+--------+-------+
|   Arun|  Indore|    500|
|Shubham|  Indore|   1000|
| Mukesh|Hariyana|  10000|
|  Kanha|  Bhopal| 100000|
| Nandan|Jabalpur|1000000|
|   Raju|  Rohtak|1000000|
+-------+--------+-------+

2) Employee1 dataset
+-------+--------+------+
|   name| address|salary|
+-------+--------+------+
|   Arun|  Indore|   500|
|Shubham|  Indore|  1000|
| Mukesh|Hariyana| 10000|
+-------+--------+------+

3) Applied leftanti join and final data
+------+--------+-------+
|  name| address| salary|
+------+--------+-------+
| Kanha|  Bhopal| 100000|
|Nandan|Jabalpur|1000000|
|  Raju|  Rohtak|1000000|
+------+--------+-------+

左反加入Spark？

5 个答案: