如何合并列为NaN的连续行

时间:2018-12-19 14:14:46

标签: python pandas dataframe

我有这样的数据,这让我发疯。来源是我用tabula阅读的pdf文件以提取表格。问题是表中的某些行在文档中是多行,这就是我看到输出的方式。

> sub_df.iloc[85:95]
1      Acronym     Meaning
86      ABC        Aaaaa Bbbbb Ccccc
87      CDE        Ccccc Ddddd Eeeee
88      NaN        Fffff Ggggg 
89      FGH        NaN
90      NaN        Hhhhh
91      IJK        Iiiii Jjjjj Kkkkk
92      LMN        Lllll Mmmmm Nnnnn
93      OPQ        Ooooo Ppppp Qqqqq
94      RST        Rrrrr Sssss Ttttt
95      UVZ        Uuuuu Vvvvv Zzzzz

我想要得到的是这样的东西。

> sub_df.iloc[85:95]
1      Acronym     Meaning
86      ABC        Aaaaa Bbbbb Ccccc
87      CDE        Ccccc Ddddd Eeeee
88      FGH        Fffff Ggggg Hhhhh      
91      IJK        Iiiii Jjjjj Kkkkk
92      LMN        Lllll Mmmmm Nnnnn
93      OPQ        Ooooo Ppppp Qqqqq
94      RST        Rrrrr Sssss Ttttt
95      UVZ        Uuuuu Vvvvv Zzzzz

我正为此combine_first苦苦挣扎:

sub_df.iloc[[88]].combine_first(sub_df.iloc[[87]])

但是结果不是我所期望的。

也欢迎使用groupby的解决方案。

注意:索引并不重要,可以重新设置。我只想加入一些连续的行,其列为NaN,然后​​将其转储到csv中,所以我不需要它们。

3 个答案:

答案 0 :(得分:2)

让我们尝试一下:

df = df.assign(Meaning = df['Meaning'].ffill())
mask = ~((df.Meaning.duplicated(keep='last')) & df.Acronym.isnull())

df = df[mask]

df = df.assign(Acronym = df['Acronym'].ffill())

df_out = df.groupby('Acronym').apply(lambda x: ' '.join(x['Meaning'].str.split('\s').sum())).reset_index()

输出:

  Acronym                  0
0     ABC  Aaaaa Bbbbb Ccccc
1     CDE  Ccccc Ddddd Eeeee
2     FGH  Fffff Ggggg Hhhhh
3     IJK  Iiiii Jjjjj Kkkkk
4     LMN  Lllll Mmmmm Nnnnn
5     OPQ  Ooooo Ppppp Qqqqq
6     RST  Rrrrr Sssss Ttttt
7     UVZ  Uuuuu Vvvvv Zzzzz

答案 1 :(得分:2)

这是一个非常棘手的问题,ffillbfill都不适合这个问题

s1=(~(df.Acronym.isnull()|df.Meaning.isnull())) # create the group
s=s1.astype(int).diff().ne(0).cumsum() # create the group for each bad line it will assign the single id 
bad=df[~s1]# we just only change the bad one 
good=df[s1]# keep the good one no change 


bad=bad.groupby(s.loc[bad.index]).agg({'1':'first','Acronym':'first','Meaning':lambda x : ''.join(x[x.notnull()])})


pd.concat([good,bad]).sort_index()
Out[107]: 
    1 Acronym            Meaning
0  86     ABC  Aaaaa Bbbbb Ccccc
1  87     CDE  Ccccc Ddddd Eeeee
2  88     FGH  Fffff Ggggg Hhhhh
5  91     IJK  Iiiii Jjjjj Kkkkk
6  92     LMN  Lllll Mmmmm Nnnnn
7  93     OPQ  Ooooo Ppppp Qqqqq
8  94     RST  Rrrrr Sssss Ttttt
9  95     UVZ  Uuuuu Vvvvv Zzzzz

答案 2 :(得分:2)

以下是一种使用numpy.where进行条件填充的方法:

df['Acronym'] = np.where(df[['Acronym']].assign(Meaning=df.Meaning.shift()).isna().all(1),
                         df.Acronym.ffill(),
                         df.Acronym.bfill())

clean_meaning = df.dropna().groupby('Acronym')['Meaning'].apply(lambda x : ' '.join(x)).to_frame()

df_new = (df[['1', 'Acronym']]
          .drop_duplicates(subset=['Acronym'])
          .merge(clean_meaning,
                 left_on='Acronym',
                 right_index=True))

[out]

    1 Acronym            Meaning
0  86     ABC  Aaaaa Bbbbb Ccccc
1  87     CDE  Ccccc Ddddd Eeeee
2  88     FGH  Fffff Ggggg Hhhhh
5  91     IJK  Iiiii Jjjjj Kkkkk
6  92     LMN  Lllll Mmmmm Nnnnn
7  93     OPQ  Ooooo Ppppp Qqqqq
8  94     RST  Rrrrr Sssss Ttttt
9  95     UVZ  Uuuuu Vvvvv Zzzzz