Question

我有一个大约5百万行的熊猫数据框，其中有2列“ top_level_domain”和“ category”。我想创建一个新的数据框，其中包含不同的top_level_domain以及一个类别列，该类别列以逗号分隔为唯一类别。该数据框已经具有按性质逗号分隔的类别。像google这样的其他域将具有重复类别，但我只想要一个。

数据框：

df1
    top_level_domain      category
1   google.com            Search Engines
2   service-now.com       Business, Software/Hardware
3   google-analytics.com  Internet Services
4   live.com              None Assigned
5   google.com            Content Server
6   google.com            Search Engines
7   inspectlet.com        Internet Services
8   doubleclick.net       Online Shopping, Web Ads
9   google.com            Search Engines
10  doubleclick.net       Ads

所需的输出：

df2
    top_level_domain      category
1   google.com            Search Engines, Content Server
2   service-now.com       Business, Software/Hardware
3   google-analytics.com  Internet Services
4   live.com              None Assigned
7   inspectlet.com        Internet Services
8   doubleclick.net       Online Shopping, Web Ads, Ads

完成此任务的最佳方法是什么？

我尝试了Pandas groupby multiple columns, list of multiple columns

中的所有示例

还有其他类似下面的内容，但是我仍然在类别列中得到重复。

distinct_category = distinct_category.groupby('top_level_domain')['category'].agg(lambda x: ', '.join(set(x))).reset_index()

但是我在列中得到了重复

1   zoho.com    Online Shopping, Interactive Web Applications, Interactive Web Applications, Interactive Web Applications, Motor Vehicles
1   zohopublic.com  Internet Services, Motor Vehicles, Internet Services, Online Shopping, Internet Services

Answer 1

首先展开您的数据框，以便每一行仅包含一个类别：

split = df['category'].str.split(', ')
lens = split.str.len()

df = pd.DataFrame({'top_level_domain': np.repeat(df['top_level_domain'].values, lens),
                   'category': np.concatenate(split)})

然后删除重复项，并将agg与str.join一起使用：

res = df.drop_duplicates()\
        .groupby('top_level_domain')['category'].agg(','.join)

Answer 2

首先按逗号分隔列，然后按列groupby，并使用生成器将平坦的嵌套列表与set和join一起使用：

df = (distinct_category['category'].str.split(', ')
                    .groupby(distinct_category['top_level_domain'])
                    .agg(lambda x: ', '.join(set(y for z in x for y in z)))
                    .reset_index())
print (df)

       top_level_domain                        category
0       doubleclick.net   Ads, Online Shopping, Web Ads
1  google-analytics.com               Internet Services
2            google.com  Content Server, Search Engines
3        inspectlet.com               Internet Services
4              live.com                   None Assigned
5       service-now.com     Business, Software/Hardware

另一种解决方案是分配回分割的值：

df = (distinct_category.assign(category = distinct_category['category'].str.split(', '))
                       .groupby('top_level_domain')['category']
                       .agg(lambda x: ', '.join(set(y for z in x for y in z)))
                       .reset_index())

Answer 3

以下代码对我有用：

df =df.groupby('top_level_domain')['category'].agg([('category', ', '.join)]).reset_index()

熊猫按第一列分组，并从第二列添加逗号分隔的条目

3 个答案: