当第一个数据帧的值等于第二个数据帧的值时,我需要创建返回True或False的函数。
df1 (Look up dataframe)
root
|-- Customer_ID: string (nullable = true)
|-- Customer_Name: string (nullable = true)
df2 (in coming data frame)
root
|-- CustomerID: string (nullable = true)
|-- CustomerName: string (nullable = true)
|-- Address: string (nullable = true)
|-- ZipCode: double (nullable = true)
|-- State: string (nullable = true)
df1.id [123, 234, 345, 456, 567]
df2.id [123, 567]
def fn_new_function(df1, df2):
is_same = df2.join(df1,df2('CustomerID') == df1('Customer_ID'), how='inner') \
.where(df2['CustomerName'] == df1['Customer_Name'])\
.count() == df2.count()
if is_same.count() > 0:
return True
else:
return False
以下错误是....
Traceback (most recent call last):
File "/---/process_files.py", line 234, in <module>
main()
File "/---/process_files.py", line 456, in fn_new_function
is_same = df2.join(df1,df2('id')!=df1('c_id'), how='inner') \
TypeError: 'DataFrame' object is not callable
答案 0 :(得分:3)
如果您的数据框架具有相同的架构,则可以减去并检查结果是否为空:
is_same = df1.subtract(df2).count() == 0
如果您只想检查两个数据帧中的所有id
字段是否匹配(不同的架构),您只需比较ID字段的DF:
is_same = df1.select('id').subtract(df2.select('id')).count() == 0
请注意,这并没有考虑到这一点。
要验证df2
中的所有记录是否与df1
中定义的名称相匹配,您可以使用联接和过滤器:
is_same = df2.join(df1, on='id', how='inner')\
.where(df1['CustomerName'] == df2['CustomerName'])\
.count() == df2.count()
此版本执行内部联接并过滤掉2个数据框中没有匹配名称的任何内容。假设所有名称都匹配,结果计数与df2
中的总记录数相同。
答案 1 :(得分:0)
def fn_new_function(df1, df2):
df1.show()
df2.show()
is_same = df1.join(df2,df1('Customer_ID') == df2('CustomerID'), how='inner') \
.where(df1['Customer_Name'] == df2['CustomerName'])
if is_same.count() > 0:
return True
else:
return False