我正在尝试比较 Pandas 数据框中的 4 列,并根据结果填充第 5 列。在普通 SQL 中,它会是这样的:
if speciality_new is null and location_new is null then 'No match found'
elif specialty <> specialty_new and location <> location_new then 'both are different'
elif specialty_new is null then 'specialty not found'
elif location_new is null then 'location not found'
else 'true'
我读到这可以使用 np.where 实现,但我的代码失败了。有人可以告诉我我做错了什么。这是我写的:
masterDf['Match'] = np.where(
masterDf[speciality_new].isnull() & masterDf[location_new].isnull(), 'No match found',
masterDf[speciality] != masterDf[speciality_new] & masterDf[location] != masterDf[location_new], 'Both specialty and location didnt match',
masterDf[speciality] != masterDf[speciality_new], 'Specialty didnt match',
masterDf[location] != masterDf[location_new], 'Location didnt match',
True)
错误信息是 TypeError: unsupported operand type(s) for &: 'str' and 'str'
没有任何意义,因为 '&' 是 'and' 的语法
dfsample 是我所拥有的,dfFinal 是我想要的
dfsample = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
'location': ['texas', 'dc', 'georgia', '', 'florida'],
'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', ''],
'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida']})
dfFinal = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
'location': ['texas', 'dc', 'georgia', '', 'florida'],
'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', ''],
'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida'],
'match': ['TRUE', 'location didn’t match', 'specialty didn’t match', 'both specialty and location didn’t match', 'specialty didn’t match']})
答案 0 :(得分:1)
这里有另一种解决方法,无需 np.where。我正在使用应用功能。
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
'location': ['texas', 'dc', 'georgia', '', 'florida'],
'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', np.NaN],
'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida']})
print (df)
def master_check(x):
#print (x)
if (pd.isnull(x['speciality_new'])) & (pd.isnull(x['location_new'])): return 'No match found'
elif (x['speciality'] != x['speciality_new']) & (x['location'] != x['location_new']): return 'Both specialty and location didnt match'
elif x['speciality'] != x['speciality_new']: return 'Specialty didnt match'
elif x['location'] != x['location_new']: return 'Location didnt match'
else: return True
df['Match'] = df.apply(master_check,axis=1)
输出将是:
ID speciality location speciality_new location_new
0 1 doctor texas doctor texas
1 2 nurse dc nurse alaska
2 3 patient georgia director georgia
3 4 driver nurse maryland
4 5 director florida NaN florida
ID speciality ... location_new Match
0 1 doctor ... texas True
1 2 nurse ... alaska Location didnt match
2 3 patient ... georgia Specialty didnt match
3 4 driver ... maryland Both specialty and location didnt match
4 5 director ... florida Specialty didnt match
如果您确实想使用 numpy.where()
,那么您必须将每个 False 语句视为一个单独的 numpy.where()
。要使用 numpy.where()
实现它,您必须这样做。
import pandas as pd
import numpy as np
masterDf = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
'location': ['texas', 'dc', 'georgia', '', 'florida'],
'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', ''],
'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida']})
masterDf['Match'] = np.where(
((masterDf.speciality_new.isnull()) & (masterDf.location_new.isnull())), 'No match found',
np.where(((masterDf.speciality != masterDf.speciality_new) & (masterDf.location != masterDf.location_new)), 'Both specialty and location didnt match',
np.where((masterDf.speciality != masterDf.speciality_new), 'Specialty didnt match',
np.where((masterDf.location != masterDf.location_new), 'Location didnt match',
True))))
print (masterDf)
输出将是:
ID speciality ... location_new Match
0 1 doctor ... texas True
1 2 nurse ... alaska Location didnt match
2 3 patient ... georgia Specialty didnt match
3 4 driver ... maryland Both specialty and location didnt match
4 5 director ... florida Specialty didnt match
答案 1 :(得分:1)
要使用 numpy
分析多个条件,最好使用 numpy.select
,您应该在其中指定条件、每个条件的预期输出和默认输出,就像 if-elif -else 语句:
import numpy as np
condlist = [
dfsample['speciality_new'].isnull() & dfsample['location_new'].isnull(),
dfsample['speciality'].ne(dfsample['speciality_new']) &
dfsample['location'].ne(dfsample['location_new']),
dfsample['speciality'].ne(dfsample['speciality_new']),
dfsample['location'].ne(dfsample['location_new']),
]
choicelist = [
'No match found',
'Both specialty and location didnt match',
'Specialty didnt match',
'Location didnt match'
]
dfsample['match'] = np.select(condlist, choicelist, default=True)
print(dfsample)
其中 ne
代表“不等于”(您可以简单地使用 !=
)。
输出:
ID speciality location speciality_new location_new match
0 1 doctor texas doctor texas True
1 2 nurse dc nurse alaska Location didnt match
2 3 patient georgia director georgia Specialty didnt match
3 4 driver nurse maryland Both specialty and location didnt match
4 5 director florida florida Specialty didnt match