Question

我正在尝试比较 Pandas 数据框中的 4 列，并根据结果填充第 5 列。在普通 SQL 中，它会是这样的：

if speciality_new is null and location_new is null then 'No match found'
elif specialty <> specialty_new and location <> location_new then 'both are different'
elif specialty_new is null then 'specialty not found'
elif location_new is null then 'location not found'
else 'true'

我读到这可以使用 np.where 实现，但我的代码失败了。有人可以告诉我我做错了什么。这是我写的：

masterDf['Match'] = np.where(
    masterDf[speciality_new].isnull() & masterDf[location_new].isnull(), 'No match found',
    masterDf[speciality] != masterDf[speciality_new] & masterDf[location] != masterDf[location_new], 'Both specialty and location didnt match',
    masterDf[speciality] != masterDf[speciality_new], 'Specialty didnt match',
    masterDf[location] != masterDf[location_new], 'Location didnt match',
    True)

错误信息是 TypeError: unsupported operand type(s) for &: 'str' and 'str' 没有任何意义，因为 '&' 是 'and' 的语法

dfsample 是我所拥有的，dfFinal 是我想要的

dfsample = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
       'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
       'location': ['texas', 'dc', 'georgia', '', 'florida'],
       'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', ''],
       'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida']})

dfFinal = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
       'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
       'location': ['texas', 'dc', 'georgia', '', 'florida'],
       'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', ''],
       'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida'],
       'match': ['TRUE', 'location didn’t match', 'specialty didn’t match', 'both specialty and location didn’t match', 'specialty didn’t match']})

Answer 1

这里有另一种解决方法，无需 np.where。我正在使用应用功能。

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
       'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
       'location': ['texas', 'dc', 'georgia', '', 'florida'],
       'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', np.NaN],
       'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida']})

print (df)

def master_check(x):
    #print (x)
    if    (pd.isnull(x['speciality_new'])) & (pd.isnull(x['location_new'])): return 'No match found'
    elif  (x['speciality'] != x['speciality_new']) & (x['location'] != x['location_new']): return 'Both specialty and location didnt match'
    elif  x['speciality'] != x['speciality_new']: return 'Specialty didnt match'
    elif  x['location'] != x['location_new']: return 'Location didnt match'
    else: return True

df['Match'] = df.apply(master_check,axis=1)

输出将是：

ID speciality location speciality_new location_new
0   1     doctor    texas         doctor        texas
1   2      nurse       dc          nurse       alaska
2   3    patient  georgia       director      georgia
3   4     driver                   nurse     maryland
4   5   director  florida            NaN      florida


ID speciality  ... location_new                                    Match
0   1     doctor  ...        texas                                     True
1   2      nurse  ...       alaska                     Location didnt match
2   3    patient  ...      georgia                    Specialty didnt match
3   4     driver  ...     maryland  Both specialty and location didnt match
4   5   director  ...      florida                    Specialty didnt match

如果您确实想使用 numpy.where()，那么您必须将每个 False 语句视为一个单独的 numpy.where()。要使用 numpy.where() 实现它，您必须这样做。

import pandas as pd
import numpy as np

masterDf = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
       'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
       'location': ['texas', 'dc', 'georgia', '', 'florida'],
       'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', ''],
       'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida']})


masterDf['Match'] = np.where(
    ((masterDf.speciality_new.isnull()) & (masterDf.location_new.isnull())), 'No match found',
    np.where(((masterDf.speciality != masterDf.speciality_new) & (masterDf.location != masterDf.location_new)), 'Both specialty and location didnt match',
    np.where((masterDf.speciality != masterDf.speciality_new), 'Specialty didnt match',
    np.where((masterDf.location != masterDf.location_new), 'Location didnt match',
    True))))

print (masterDf)

输出将是：

   ID speciality  ... location_new                                    Match
0   1     doctor  ...        texas                                     True
1   2      nurse  ...       alaska                     Location didnt match
2   3    patient  ...      georgia                    Specialty didnt match
3   4     driver  ...     maryland  Both specialty and location didnt match
4   5   director  ...      florida                    Specialty didnt match

Answer 2

要使用 numpy 分析多个条件，最好使用 numpy.select，您应该在其中指定条件、每个条件的预期输出和默认输出，就像 if-elif -else 语句：

import numpy as np

condlist = [
    dfsample['speciality_new'].isnull() & dfsample['location_new'].isnull(),
    dfsample['speciality'].ne(dfsample['speciality_new']) & 
    dfsample['location'].ne(dfsample['location_new']),
    dfsample['speciality'].ne(dfsample['speciality_new']),
    dfsample['location'].ne(dfsample['location_new']),
]

choicelist = [
    'No match found',
    'Both specialty and location didnt match',
    'Specialty didnt match',
    'Location didnt match'
]

dfsample['match'] = np.select(condlist, choicelist, default=True)
print(dfsample)

其中 ne 代表“不等于”（您可以简单地使用 !=）。

输出：

   ID speciality location speciality_new location_new                                    match
0   1     doctor    texas         doctor        texas                                     True
1   2      nurse       dc          nurse       alaska                     Location didnt match
2   3    patient  georgia       director      georgia                    Specialty didnt match
3   4     driver                   nurse     maryland  Both specialty and location didnt match
4   5   director  florida                     florida                    Specialty didnt match

使用多个 if else

2 个答案: