假设我有一个显示速度限制的数据集。这个想法是每个地区或城市都可以应用自己的规则,或“继承”其父实体的规则。
+-------------+---------------------------+---------------------+-----------+
| country | region | city | max_speed |
+-------------+---------------------------+---------------------+-----------+
| France | | | 50 |
+-------------+---------------------------+---------------------+-----------+
| France | Bretagne | | 70 |
+-------------+---------------------------+---------------------+-----------+
| France | Bretagne | Saint-Grégoire | |
+-------------+---------------------------+---------------------+-----------+
| France | Bretagne | Saint-Malo | 30 |
+-------------+---------------------------+---------------------+-----------+
| France | Île-de-France | | |
+-------------+---------------------------+---------------------+-----------+
| France | Île-de-France | Saint-Cloud | |
+-------------+---------------------------+---------------------+-----------+
| France | Île-de-France | Vélizy-Villacoublay | 50 |
+-------------+---------------------------+---------------------+-----------+
| Germany | | | 70 |
+-------------+---------------------------+---------------------+-----------+
| Germany | Bayern | | |
+-------------+---------------------------+---------------------+-----------+
| Germany | Bayern | Nürnberg | |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | | | 90 |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland | | |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland | Harderwijk | |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | | 70 |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Haarlem | |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Hoorn | 30 |
+-------------+---------------------------+---------------------+-----------+
每当 max_speed
值缺失时,应将其推断为父级的值。例如,Saint-Grégoire 的限速是 Bretagne,而 Harderwijk 和 Nürnberg 则适用该规则的国家(即分别为 90 和 70)。
因此,鉴于此 DataFrame
:
data = {'country': ['France', 'France', 'France', 'France', 'France', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands'],
'region': [None, 'Bretagne', 'Bretagne', 'Bretagne', 'Île-de-France', 'Île-de-France', 'Île-de-France', None, 'Bayern', 'Bayern', None, 'Provincie Gelderland', 'Provincie Gelderland', 'Provincie Noord-Holland', 'Provincie Noord-Holland', 'Provincie Noord-Holland'],
'city': [None, None, 'Saint-Grégoire', 'Saint-Malo', None, 'Saint-Cloud', 'Vélizy-Villacoublay', None, None, 'Nürnberg', None, None, 'Harderwijk', None, 'Haarlem', 'Hoorn'],
'max_speed': [50, 70, None, 30, None, None, 50, 70, None, None, 90, None, None, 70, None, 30]}
speed_limits = pd.DataFrame(data)
如何填写 max_speed
中的缺失值以获得:
+-------------+-------------------------+---------------------+-----------+
| country | region | city | max_speed |
+-------------+-------------------------+---------------------+-----------+
| France | | | 50 |
+-------------+-------------------------+---------------------+-----------+
| France | Bretagne | | 70 |
+-------------+-------------------------+---------------------+-----------+
| France | Bretagne | Saint-Grégoire | 70 |
+-------------+-------------------------+---------------------+-----------+
| France | Bretagne | Saint-Malo | 30 |
+-------------+-------------------------+---------------------+-----------+
| France | Île-de-France | | 50 |
+-------------+-------------------------+---------------------+-----------+
| France | Île-de-France | Saint-Cloud | 50 |
+-------------+-------------------------+---------------------+-----------+
| France | Île-de-France | Vélizy-Villacoublay | 50 |
+-------------+-------------------------+---------------------+-----------+
| Germany | | | 70 |
+-------------+-------------------------+---------------------+-----------+
| Germany | Bayern | | 70 |
+-------------+-------------------------+---------------------+-----------+
| Germany | Bayern | Nürnberg | 70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | | | 90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland | | 90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland | Harderwijk | 90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | | 70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Haarlem | 70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Hoorn | 30 |
+-------------+-------------------------+---------------------+-----------+
我一直在尝试创建一个函数来应用于 max_speed==np.NaN
的每一行,检索其父级(在确定缺失值适用于地区还是城市之后)并返回其 max_speed
价值,但是,除了在这方面不太成功之外,我什至不确定这是最聪明的方法。
有什么想法吗?
答案 0 :(得分:0)
这是我的试验。 记录和一些用于调试的打印()。 query() 语句基于 pandas/NumPy 使用 np.nan != np.nan 的事实,并将 None 视为 np.nan。 请参阅此页面上的注释/警告之一 https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
import pandas as pd
data = {'country': ['France', 'France', 'France', 'France', 'France', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands'],
'region': [None, 'Bretagne', 'Bretagne', 'Bretagne', 'Île-de-France', 'Île-de-France', 'Île-de-France', None, 'Bayern', 'Bayern', None, 'Provincie Gelderland', 'Provincie Gelderland', 'Provincie Noord-Holland', 'Provincie Noord-Holland', 'Provincie Noord-Holland'],
'city': [None, None, 'Saint-Grégoire', 'Saint-Malo', None, 'Saint-Cloud', 'Vélizy-Villacoublay', None, None, 'Nürnberg', None, None, 'Harderwijk', None, 'Haarlem', 'Hoorn'],
'max_speed': [50, 70, None, 30, None, None, 50, 70, None, None, 90, None, None, 70, None, 30]}
df = pd.DataFrame(data)
#1)split the initial df in multiple dfs, using df.query():
# - countries - we assume that all of them have max_speed
# - regions - two categories
# - max speed set
# - max speed unset
# - cities - to categories
# - max speed set
# - max speed unset
#
#2) use merge/join to update the max speed for the categories with max speed unset
#
#3) use append to cncatenate all sets, this is final result
# replaced None/nan wit empty string for nice printing
# those will have speed set
# city compare is superflue, but for consistency
df_countries_only = df.query("(region != region) and (city != city) ")
print(df_countries_only)
# fix the regions
df_regions_to_fix = df.query("(city != city) and (max_speed != max_speed) and (region == region)")
df_regions_ok = df.query("(city != city) and (max_speed == max_speed) and (region == region)")
df_regions_speed = pd.merge(df_countries_only.drop(['region', 'city'], axis=1),
df_regions_to_fix.drop(['max_speed'], axis=1), how="inner", on=["country"])
df_regions_speed = df_regions_speed.append(df_regions_ok)
print(df_regions_speed)
df_cities_to_fix = df.query("(city == city) and (max_speed != max_speed)")
df_cities_ok = df.query("(city == city) and (max_speed == max_speed)")
df_cities_speed = pd.merge(df_regions_speed.drop(['city'], axis=1),
df_cities_to_fix.drop(['max_speed'], axis=1), how="inner", on=["country", "region"])
print(df_cities_speed)
# now rebuild final df
df_all_data = df_cities_speed.append(df_cities_ok).append(df_regions_speed).append(df_countries_only)
print("\n\n")
print(df_all_data.sort_values(by=['country', 'region', 'city']).fillna("")[['country', 'region', 'city', 'max_speed']])
答案 1 :(得分:0)
利用 ffill()
完成工作。先垂直传播国家和地区限速,设置城市限速栏。然后从左到右传播速度限制以获得继承的最大速度限制。
创建一个工作数据框:
wf = speed_limits.copy()
复制和传播国家限速:
wf['cntry_spd'] = pd.Series(np.where(wf['region'], np.nan, wf['max_speed'])).ffill()
复制区域限速并在区域内传播:
wf['reg_spd'] = np.where(~wf['region'].isna() & wf['city'].isna(), wf['max_speed'], np.nan)
wf['reg_spd'] = wf.groupby(['country','region'])['reg_spd'].ffill()
创建仅限城市的限速列:
wf['city_spd'] = np.where(~wf['city'].isna(), wf['max_speed'], np.nan)
通过从左到右跨 max_speed
、speed_limits DF
、cntry_spd
列向前填充 NA 来设置 reg_spd
上的 city_spd
列,继承速度限制尚未设置:
speed_limits['max_speed'] = wf[['cntry_spd','reg_spd','city_spd']].ffill(axis=1)['city_spd']
结果:
country region city max_speed
0 France None None 50.0
1 France Bretagne None 70.0
2 France Bretagne Saint-Grégoire 70.0
3 France Bretagne Saint-Malo 30.0
4 France Île-de-France None 50.0
5 France Île-de-France Saint-Cloud 50.0
6 France Île-de-France Vélizy-Villacoublay 50.0
7 Germany None None 70.0
8 Germany Bayern None 70.0
9 Germany Bayern Nürnberg 70.0
10 Netherlands None None 90.0
11 Netherlands Provincie Gelderland None 90.0
12 Netherlands Provincie Gelderland Harderwijk 90.0
13 Netherlands Provincie Noord-Holland None 70.0
14 Netherlands Provincie Noord-Holland Haarlem 70.0
15 Netherlands Provincie Noord-Holland Hoorn 30.0