比较两个数据帧以返回True或False

时间:2018-04-17 19:11:50

标签: python-2.7 pyspark

当第一个数据帧的值等于第二个数据帧的值时,我需要创建返回True或False的函数。

df1 (Look up dataframe)
  root
 |-- Customer_ID: string (nullable = true)
 |-- Customer_Name: string (nullable = true)

df2 (in coming data frame)
 root
 |-- CustomerID: string (nullable = true)
 |-- CustomerName: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- ZipCode: double (nullable = true)
 |-- State: string (nullable = true)

df1.id [123, 234, 345, 456, 567]
df2.id [123, 567]

def fn_new_function(df1, df2):
   is_same = df2.join(df1,df2('CustomerID') == df1('Customer_ID'), how='inner') \
            .where(df2['CustomerName'] == df1['Customer_Name'])\ 
            .count() == df2.count()

   if is_same.count() > 0:
      return True
   else:
      return False

以下错误是....

Traceback (most recent call last):
  File "/---/process_files.py", line 234, in <module>
    main()
  File "/---/process_files.py", line 456, in fn_new_function
    is_same = df2.join(df1,df2('id')!=df1('c_id'), how='inner') \
TypeError: 'DataFrame' object is not callable

2 个答案:

答案 0 :(得分:3)

如果您的数据框架具有相同的架构,则可以减去并检查结果是否为空:

is_same = df1.subtract(df2).count() == 0

如果您只想检查两个数据帧中的所有id字段是否匹配(不同的架构),您只需比较ID字段的DF:

is_same = df1.select('id').subtract(df2.select('id')).count() == 0

请注意,这并没有考虑到这一点。

要验证df2中的所有记录是否与df1中定义的名称相匹配,您可以使用联接和过滤器:

is_same = df2.join(df1, on='id', how='inner')\
             .where(df1['CustomerName'] == df2['CustomerName'])\
             .count() == df2.count()

此版本执行内部联接并过滤掉2个数据框中没有匹配名称的任何内容。假设所有名称都匹配,结果计数与df2中的总记录数相同。

答案 1 :(得分:0)

def fn_new_function(df1, df2):

   df1.show()
   df2.show()

   is_same = df1.join(df2,df1('Customer_ID') == df2('CustomerID'), how='inner') \
            .where(df1['Customer_Name'] == df2['CustomerName'])

   if is_same.count() > 0:
      return True
   else:
      return False