填充 DataFrame 中其他行中父项的空白

时间:2021-07-02 15:41:19

标签: python pandas

假设我有一个显示速度限制的数据集。这个想法是每个地区或城市都可以应用自己的规则,或“继承”其父实体的规则。

+-------------+---------------------------+---------------------+-----------+
| country     | region                    | city                | max_speed |
+-------------+---------------------------+---------------------+-----------+
| France      |                           |                     | 50        |
+-------------+---------------------------+---------------------+-----------+
| France      | Bretagne                  |                     | 70        |
+-------------+---------------------------+---------------------+-----------+
| France      | Bretagne                  | Saint-Grégoire      |           |
+-------------+---------------------------+---------------------+-----------+
| France      | Bretagne                  | Saint-Malo          | 30        |
+-------------+---------------------------+---------------------+-----------+
| France      | Île-de-France             |                     |           |
+-------------+---------------------------+---------------------+-----------+
| France      | Île-de-France             | Saint-Cloud         |           |
+-------------+---------------------------+---------------------+-----------+
| France      | Île-de-France             | Vélizy-Villacoublay | 50        |
+-------------+---------------------------+---------------------+-----------+
| Germany     |                           |                     | 70        |
+-------------+---------------------------+---------------------+-----------+
| Germany     | Bayern                    |                     |           |
+-------------+---------------------------+---------------------+-----------+
| Germany     | Bayern                    | Nürnberg            |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands |                           |                     | 90        |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland      |                     |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie   Gelderland    | Harderwijk          |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland   |                     | 70        |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland   | Haarlem             |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland   | Hoorn               | 30        |
+-------------+---------------------------+---------------------+-----------+

每当 max_speed 值缺失时,应将其推断为父级的值。例如,Saint-Grégoire 的限速是 Bretagne,而 HarderwijkNürnberg 则适用该规则的国家(即分别为 90 和 70)。

因此,鉴于此 DataFrame

data = {'country': ['France', 'France', 'France', 'France', 'France', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands'],
'region': [None, 'Bretagne', 'Bretagne', 'Bretagne', 'Île-de-France', 'Île-de-France', 'Île-de-France', None, 'Bayern', 'Bayern', None, 'Provincie Gelderland', 'Provincie Gelderland', 'Provincie Noord-Holland', 'Provincie Noord-Holland', 'Provincie Noord-Holland'],
'city': [None, None, 'Saint-Grégoire', 'Saint-Malo', None, 'Saint-Cloud', 'Vélizy-Villacoublay', None, None, 'Nürnberg', None, None, 'Harderwijk', None, 'Haarlem', 'Hoorn'],
'max_speed': [50, 70, None, 30, None, None, 50, 70, None, None, 90, None, None, 70, None, 30]}

speed_limits = pd.DataFrame(data)

如何填写 max_speed 中的缺失值以获得:

+-------------+-------------------------+---------------------+-----------+
| country     | region                  | city                | max_speed |
+-------------+-------------------------+---------------------+-----------+
| France      |                         |                     |        50 |
+-------------+-------------------------+---------------------+-----------+
| France      | Bretagne                |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| France      | Bretagne                | Saint-Grégoire      |        70 |
+-------------+-------------------------+---------------------+-----------+
| France      | Bretagne                | Saint-Malo          |        30 |
+-------------+-------------------------+---------------------+-----------+
| France      | Île-de-France           |                     |        50 |
+-------------+-------------------------+---------------------+-----------+
| France      | Île-de-France           | Saint-Cloud         |        50 |
+-------------+-------------------------+---------------------+-----------+
| France      | Île-de-France           | Vélizy-Villacoublay |        50 |
+-------------+-------------------------+---------------------+-----------+
| Germany     |                         |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| Germany     | Bayern                  |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| Germany     | Bayern                  | Nürnberg            |        70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands |                         |                     |        90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland    |                     |        90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland    | Harderwijk          |        90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Haarlem             |        70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Hoorn               |        30 |
+-------------+-------------------------+---------------------+-----------+

我一直在尝试创建一个函数来应用于 max_speed==np.NaN 的每一行,检索其父级(在确定缺失值适用于地区还是城市之后)并返回其 max_speed价值,但是,除了在这方面不太成功之外,我什至不确定这是最聪明的方法。

有什么想法吗?

2 个答案:

答案 0 :(得分:0)

这是我的试验。 记录和一些用于调试的打印()。 query() 语句基于 pandas/NumPy 使用 np.nan != np.nan 的事实,并将 None 视为 np.nan。 请参阅此页面上的注释/警告之一 https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

import pandas  as pd
data = {'country': ['France', 'France', 'France', 'France', 'France', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands'],
'region': [None, 'Bretagne', 'Bretagne', 'Bretagne', 'Île-de-France', 'Île-de-France', 'Île-de-France', None, 'Bayern', 'Bayern', None, 'Provincie Gelderland', 'Provincie Gelderland', 'Provincie Noord-Holland', 'Provincie Noord-Holland', 'Provincie Noord-Holland'],
'city': [None, None, 'Saint-Grégoire', 'Saint-Malo', None, 'Saint-Cloud', 'Vélizy-Villacoublay', None, None, 'Nürnberg', None, None, 'Harderwijk', None, 'Haarlem', 'Hoorn'],
'max_speed': [50, 70, None, 30, None, None, 50, 70, None, None, 90, None, None, 70, None, 30]}

df = pd.DataFrame(data)

#1)split the initial df in multiple dfs, using df.query():
# - countries - we assume that all of them have max_speed
# - regions - two categories
#   - max speed set
#   - max speed unset
# - cities - to categories 
#   - max speed set
#   - max speed unset
#
#2) use merge/join to update the max speed for the categories with max speed unset
#
#3) use append to cncatenate all sets, this is final result 
# replaced None/nan wit empty string for nice printing

# those will have speed set
# city compare is superflue, but for consistency
df_countries_only = df.query("(region != region) and (city != city) ")
print(df_countries_only)

# fix the regions
df_regions_to_fix = df.query("(city != city) and (max_speed != max_speed) and (region == region)")
df_regions_ok = df.query("(city != city) and (max_speed == max_speed) and (region == region)")

df_regions_speed = pd.merge(df_countries_only.drop(['region', 'city'], axis=1), 
        df_regions_to_fix.drop(['max_speed'], axis=1), how="inner", on=["country"])
df_regions_speed = df_regions_speed.append(df_regions_ok)
print(df_regions_speed)

df_cities_to_fix = df.query("(city == city) and (max_speed != max_speed)")
df_cities_ok = df.query("(city == city) and (max_speed == max_speed)")

df_cities_speed = pd.merge(df_regions_speed.drop(['city'], axis=1), 
        df_cities_to_fix.drop(['max_speed'], axis=1), how="inner", on=["country", "region"])

print(df_cities_speed)

# now rebuild final df
df_all_data = df_cities_speed.append(df_cities_ok).append(df_regions_speed).append(df_countries_only)
print("\n\n")
print(df_all_data.sort_values(by=['country', 'region', 'city']).fillna("")[['country', 'region', 'city', 'max_speed']])

答案 1 :(得分:0)

利用 ffill() 完成工作。先垂直传播国家和地区限速,设置城市限速栏。然后从左到右传播速度限制以获得继承的最大速度限制。

创建一个工作数据框:

wf = speed_limits.copy()

复制和传播国家限速:

wf['cntry_spd'] = pd.Series(np.where(wf['region'], np.nan, wf['max_speed'])).ffill()

复制区域限速并在区域内传播:

wf['reg_spd'] = np.where(~wf['region'].isna() & wf['city'].isna(), wf['max_speed'], np.nan)
wf['reg_spd'] = wf.groupby(['country','region'])['reg_spd'].ffill() 

创建仅限城市的限速列:

wf['city_spd'] = np.where(~wf['city'].isna(), wf['max_speed'], np.nan)

通过从左到右跨 max_speedspeed_limits DFcntry_spd 列向前填充 NA 来设置 reg_spd 上的 city_spd 列,继承速度限制尚未设置:

speed_limits['max_speed'] = wf[['cntry_spd','reg_spd','city_spd']].ffill(axis=1)['city_spd']

结果:

        country                   region                 city  max_speed
0        France                     None                 None       50.0
1        France                 Bretagne                 None       70.0
2        France                 Bretagne       Saint-Grégoire       70.0
3        France                 Bretagne           Saint-Malo       30.0
4        France            Île-de-France                 None       50.0
5        France            Île-de-France          Saint-Cloud       50.0
6        France            Île-de-France  Vélizy-Villacoublay       50.0
7       Germany                     None                 None       70.0
8       Germany                   Bayern                 None       70.0
9       Germany                   Bayern             Nürnberg       70.0
10  Netherlands                     None                 None       90.0
11  Netherlands     Provincie Gelderland                 None       90.0
12  Netherlands     Provincie Gelderland           Harderwijk       90.0
13  Netherlands  Provincie Noord-Holland                 None       70.0
14  Netherlands  Provincie Noord-Holland              Haarlem       70.0
15  Netherlands  Provincie Noord-Holland                Hoorn       30.0