Question

当我尝试使用

连接两个数据帧时

DataFrame joindf = dataFrame.join(df, df.col(joinCol)); //.equalTo(dataFrame.col(joinCol)));

我的程序抛出异常

org.apache.spark.sql.AnalysisException：连接条件'url'的类型 string不是布尔值。;

这里的joinCol值是url 需要输入作为可能导致这些异常的内容

Answer 1

join变体作为第二个参数Column，期望它可以作为布尔表达式进行求值。

如果您想要基于列名称的简单等连接，请使用a version which takes a column name as a String：

String joinCol = "foo";
dataFrame.join(df, joinCol);

Answer 2

这意味着连接条件应该计算为表达式。假设我们想基于id加入2个数据帧，那么我们能做的是：

使用Python：

df1.join(df2, df['id'] == df['id'], 'left')  # 3rd parameter is type of join which in this case is left join

使用Scala：

df1.join(df2, df('id') === df('id'))    // create inner join based on id column

Answer 3

您不能使用df.col（joinCol），因为这不是表达式。为了联接2个数据框，您需要标识要联接的列

假设您有一个DataFrame emp和dept，将这两个数据框连接起来应该像下面在Scala中一样

empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"inner")
    .show(false)