从DataFrame解压缩列表和元组

时间:2020-03-07 16:28:08

标签: python pandas

DataFrame中的单元格具有奇数格式,其中数据存储在列表和元组中。我想解压缩值并将它们拆分为行。目前,我有以下DataFrame:

d={'Filename': {0: 'A', 1: 'B'},
 'RGB': {0: [([(0, 1650), (6, 39)], [(0, 1691), (1, 59)], [(50, 1402), (49, 187)])],
  1: [([(0, 1423), (16, 38)], [(0, 1445), (16, 46)], [(0, 1419), (16, 39)])]},
 'RGB_type': {0: ['r', 'g', 'b'], 1: ['r', 'g', 'b']}}
df=pd.DataFrame(d)

print(df)
    Filename    RGB                                                                             RGB_type
0   A           [([(0, 1650), (6, 39)], [(0, 1691), (1, 59)], [(50, 1402), (49, 187)])]         [r, g, b]
1   B           [([(0, 1423), (16, 38)], [(0, 1445), (16, 46)], [(0, 1419), (16, 39)])]         [r, g, b]

我希望将其设置为以下格式:

     Filename    Top 1 colour    Top 1 frequency    Top 2 colour    Top 2 frequency  rgb
0    A           0               1650               6               39               r
0    A           0               1691               1               59               g
0    A           50              1402               49              187              b
1    B           0               1423               16              38               r
1    B           0               1445               16              46               g
1    B           0               1419               16              39               b

我已经能够使用df_it.RGB.apply(pd.Series)访问第一个列表,但是现在我不确定如何继续。

2 个答案:

答案 0 :(得分:2)

这是一种方法:

from itertools import chain
import numpy as np
# flatten the lists into an array and reshape into 4 columns
a = np.array(list(chain.from_iterable(df.RGB.values)))
out = pd.DataFrame(a.reshape(-1,4), 
                   columns=['Top 1 colour','Top 1 frequency',
                            'Top 2 colour','Top 2 frequency'])
# explode the remaining columns and assign back to the new dataframe
out.assign(**df.explode('RGB_type')[['Filemane', 'RGB_type']]
               .reset_index(drop=True))

        Top 1 colour  Top 1 frequency  Top 2 colour  Top 2 frequency Filemane  \
0             0             1650             6               39        A   
1             0             1691             1               59        A   
2            50             1402            49              187        A   
3             0             1423            16               38        B   
4             0             1445            16               46        B   
5             0             1419            16               39        B   

     RGB_type  
0        r  
1        g  
2        b  
3        r  
4        g  
5        b  

答案 1 :(得分:1)

汇总列信息并扩展为单个列的另一种方法

df['RGB'] = df['RGB'].apply(lambda a: [list(sum(y,())) for y in a[0]])
df = df.reindex(df.index.repeat(df['RGB_type'].apply(len)))
df = df.groupby('Filename').apply(lambda x:x.apply(lambda y: pd.Series(y.iloc[0])))

出局:

    Filename    RGB RGB_type
0   A   [0, 1650, 6, 39]    r
1   NaN [0, 1691, 1, 59]    g
2   NaN [50, 1402, 49, 187] b
3   B   [0, 1423, 16, 38]   r
4   NaN [0, 1445, 16, 46]   g
5   NaN [0, 1419, 16, 39]   b



df.join(pd.DataFrame(df['RGB'].tolist(),columns=['Top 1 colour','Top 1 frequency',
                                         'Top 2 colour','Top 2 frequency'],index=te.index)).drop('RGB',1).ffill()

出局:

                Filename    RGB_type    Top 1 colour    Top 1 frequency Top 2 colour    Top 2 frequency
    Filename                            
 A  0       A   r   0   1650    6   39
    1       A   g   0   1691    1   59
    2       A   b   50  1402    49  187
 B  0       B   r   0   1423    16  38
    1       B   g   0   1445    16  46
    2       B   b   0   1419    16  39