根据其他行替换空值

时间:2020-05-27 06:59:29

标签: python pandas dataframe

我有一个包含许多列的数据框(但为了简化发布,此处仅发布col1,col2,col3):

id    col1       col2    col3   source_id
a1    765.3      234     cat    a5
a2    3298.3     none    dog    a4
a3    8762.1     27      rat    a8
a4    none       none    none   none       
a5    none       none    none   a6
a6    none       none    none   none

我想用none values of source _id来填充values from id。 例如,source_id a5 row has none必须替换为id a1 values,随后source_id a6 row having none必须替换为a5 row

输出:

id    col1       col2    col3   source_id
a1    765.3      234     cat    a5
a2    3298.3     none    dog    a4
a3    8762.1     27      rat    a8
a4    3298.3     none    dog    none       
a5    765.3      234     cat    a6
a6    765.3      234     cat    none

2 个答案:

答案 0 :(得分:1)

首先看起来none是字符串,所以将它们替换为缺少的值:

df = df.mask(df.eq('none'), None)

然后用connected_componentsnetworkx中创建字典:

import networkx as nx

# Create the graph from the dataframe
g = nx.Graph()
g.add_edges_from(df[['id','source_id']].dropna().itertuples(index=False))

connected_components = nx.connected_components(g)
# Find the component id of the nodes
node2id = {}
for cid, component in enumerate(connected_components):
    for node in component:
        node2id[node] = cid + 1

print (node2id)
{'a6': 1, 'a5': 1, 'a1': 1, 'a2': 2, 'a4': 2, 'a8': 3, 'a3': 3}

通过映射的id列进行最后分组,并通过向前和向后填充替换None

df1 = (df.groupby(df['id'].map(node2id))
         .apply(lambda x: x.ffill().bfill())
         .assign(source_id = df['source_id']))
print (df1)
   id    col1  col2 col3 source_id
0  a1   765.3   234  cat        a5
1  a2  3298.3  None  dog        a4
2  a3  8762.1    27  rat        a8
3  a4  3298.3  None  dog      None
4  a5   765.3   234  cat        a6
5  a6   765.3   234  cat      None

答案 1 :(得分:0)

您应该做的第一件事是将id列设置为索引,以便您查找该行以填充单元格

df = df.set_index('id')

然后,您可以遍历各列并填充它们

for col in df.columns:
if col == 'source_id':
    continue
for idx in df.index:
    dst_idx = df.source_id[idx]
    if (df[col][idx] != 'none'
            and dst_idx != 'none'
            and dst_idx in df.index and
            df[col][dst_idx] == 'none'):
        df[col][dst_idx] = df[col][idx]
      col1  col2 col3 source_id
id
a1   765.3   234  cat        a5
a2  3298.3  none  dog        a4
a3  8762.1    27  rat        a8
a4  3298.3  none  dog      none
a5   765.3   234  cat        a6
a6   765.3   234  cat      none