熊猫逐列删除重复值

时间:2019-08-30 12:57:49

标签: pandas

如何在熊猫数据框中逐列删除重复项,以便:

// index.cshtml

    @model ArtikelsViewModels

    @if (Model.Id == 1003)
    {

        <h1>Common myths about DI</h1>
        <p>
            @Model.Artikel
        </p>

    }

// Artikel like that  in the database is saved : 

<ul>
<li>Di is only revleant for late binding.</li> 
<li>Di is only revleant for unit testing.</li>
<li>Di is asort of abstract factory on steroids.</li>
<li>Di requiers a DI container</li>
<ul>

成为:

set1    set2    set3    set4
apple   apple   orange  orange
apple   orange  banana  orange
orange  banana  pear    
banana  banana  lemon   
pear            lemon   
grape           lemon

4 个答案:

答案 0 :(得分:3)

使用:

m=df.apply(lambda x:dict.fromkeys(x).keys())
pd.DataFrame(m.values.tolist(),index=m.index).T

或者以一种更好的方式来礼貌@piRSquared

pd.DataFrame.from_dict({k: {*df[k].dropna()} for k in df}, orient='index').T

     set1    set2    set3    set4
0   apple   apple  orange  orange
1  orange  orange  banana     NaN
2  banana  banana    pear    None
3    pear     NaN   lemon    None
4   grape    None    None    None

答案 1 :(得分:3)

itertools.zip_longest

from itertools import zip_longest

pd.DataFrame(
    [*zip_longest(*({*df[c].dropna()} for c in df))],
    columns=[*df]
)

     set1    set2    set3    set4
0  banana  orange  banana  orange
1   grape  banana   lemon    None
2    pear   apple    pear    None
3   apple    None  orange    None
4  orange    None    None    None

collections.defaultdictitertools.count

# %%timeit
from collections import defaultdict
from itertools import count
i = defaultdict(count)

pd.DataFrame({c: {next(i[c]): v for v in {*df[c].dropna()}} for c in df})

     set1    set2    set3    set4
0    pear   apple  orange  orange
1   grape  banana   lemon     NaN
2   apple  orange  banana     NaN
3  banana     NaN    pear     NaN
4  orange     NaN     NaN     NaN

答案 2 :(得分:3)

这是另一种pivot

df.melt().dropna().drop_duplicates(['variable','value']).\
   assign(key=lambda x : x.groupby('variable').cumcount()).pivot(index='key',columns='variable',values='value')
Out[806]: 
variable    set1    set2    set3    set4
key                                     
0          apple   apple  orange  orange
1         orange  orange  banana     NaN
2         banana  banana    pear     NaN
3           pear     NaN   lemon     NaN
4          grape     NaN     NaN     NaN

答案 3 :(得分:1)

您也可以使用drop_duplicates

df.apply(lambda x : x.drop_duplicates().reset_index(drop=True))

>

     set1    set2    set3    set4
0   apple   apple  orange  orange
1  orange  orange  banana     NaN
2  banana  banana    pear     NaN
3    pear     NaN   lemon     NaN
4   grape     NaN     NaN     NaN