去除NaN'细胞'不丢弃整个ROW(Pandas,Python3)

时间:2014-09-19 20:32:27

标签: python python-3.x pandas

现在我有这样的DF

 Word       Word2          Word3
 Hello      NaN            NaN
 My         My Name        NaN
 Yellow     Yellow Bee     Yellow Bee Hive
 Golden     Golden Gates   NaN
 Yellow     NaN            NaN

我希望从我的数据框中删除所有NaN细胞。所以最后,它看起来像这样,'Yellow Bee Hive'已移至第1行(类似于从excel中的列中删除单元格时发生的情况):

   Word       Word2             Word3
1  Hello      My Name        Yellow Bee Hive
2  My         Yellow Bee       
3  Yellow     Golden Gates             
4  Golden       
5  Yellow    

不幸的是,这些都不起作用,因为他们删除了整条行!

 df = df[pd.notnull(df['Word','Word2','Word3'])]

 df = df.dropna() 

有人有什么建议吗?我应该重新索引表吗?

3 个答案:

答案 0 :(得分:3)

import numpy as np
import pandas as pd
import functools

def drop_and_roll(col, na_position='last', fillvalue=np.nan):
    result = np.full(len(col), fillvalue, dtype=col.dtype)
    mask = col.notnull()
    N = mask.sum()
    if na_position == 'last':
        result[:N] = col.loc[mask]
    elif na_position == 'first':
        result[-N:] = col.loc[mask]
    else:
        raise ValueError('na_position {!r} unrecognized'.format(na_position))
    return result

df = pd.read_table('data', sep='\s{2,}')

print(df.apply(functools.partial(drop_and_roll, fillvalue='')))

产量

     Word         Word2            Word3
0   Hello       My Name  Yellow Bee Hive
1      My    Yellow Bee                 
2  Yellow  Golden Gates                 
3  Golden                               
4  Yellow     

答案 1 :(得分:1)

由于您希望值向上移动,因此您必须创建新的数据框

开始 -

     Word         Word2
0   Hello           NaN
1      My       My Name
2  Yellow    Yellow Bee
3  Golden  Golden Gates
4  Yellow           NaN

使用以下方法 -

def get_column_array(df, column):
    expected_length = len(df)
    current_array = df[column].dropna().values
    if len(current_array) < expected_length:
        current_array = np.append(current_array, [''] * (expected_length - len(current_array)))
    return current_array

pd.DataFrame({column: get_column_array(df, column) for column in df.columns}

给予 -

     Word         Word2
0   Hello       My Name
1      My    Yellow Bee
2  Yellow  Golden Gates
3  Golden              
4  Yellow              

您也可以使用相同的功能编辑现有的df -

for column in df.columns:
    df[column] = get_column_array(df, column)

答案 2 :(得分:1)

我认为您可以使用此

df = df.apply(lambda x: pd.Series(x.dropna().values))

例如:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Word':['Hello', 'My', 'Yellow', 'Golden', 'Yellow'],
    'Word2':[np.nan, 'My Name', 'Yellow Bee', 'Golden Gates', np.nan],
    'Word3':[np.nan, np.nan, 'Yellow Bee Hive', np.nan, np.nan]
})

print(df)

初始数据框:

     Word         Word2            Word3
0   Hello           NaN              NaN
1      My       My Name              NaN
2  Yellow    Yellow Bee  Yellow Bee Hive
3  Golden  Golden Gates              NaN
4  Yellow           NaN              NaN

并应用此lambda函数:

df = df.apply(lambda x: pd.Series(x.dropna().values))

print(df)

给予:

     Word         Word2            Word3
0   Hello       My Name  Yellow Bee Hive
1      My    Yellow Bee              NaN
2  Yellow  Golden Gates              NaN
3  Golden           NaN              NaN
4  Yellow           NaN              NaN

然后,您可以用空字符串填充NaN值:

df = df.fillna('')

print(df)

     Word         Word2            Word3
0   Hello       My Name  Yellow Bee Hive
1      My    Yellow Bee                 
2  Yellow  Golden Gates                 
3  Golden                               
4  Yellow