一站式编码数据帧上的聚合

时间:2018-10-11 15:43:55

标签: pandas group-by pivot-table

假设我在https://bpaste.net/show/05fa224794e4中有一个包含10000部电影的数据集,并且该数据集的一个摘要是

tt0111161   The Shawshank Redemption (1994) 1994     9.2    619479  142 mins.   Crime|Drama
tt0110912   Pulp Fiction (1994) 1994     9.0    490065  154 mins.   Crime|Thriller
tt0137523   Fight Club (1999)   1999     8.8    458173  139 mins.   Drama|Mystery|Thriller
tt0133093   The Matrix (1999)   1999     8.7    448114  136 mins.   Action|Adventure|Sci-Fi
tt1375666   Inception (2010)    2010     8.9    385149  148 mins.   Action|Adventure|Sci-Fi|Thriller
tt0109830   Forrest Gump (1994) 1994     8.7    368994  142 mins.   Comedy|Drama|Romance
tt0169547   American Beauty (1999)  1999     8.6    338332  122 mins.   Drama
tt0499549   Avatar (2009)   2009     8.1    336855  162 mins.   Action|Adventure|Fantasy|Sci-Fi
tt0108052   Schindler's List (1993) 1993     8.9    325888  195 mins.   Biography|Drama|History|War
tt0080684   Star Wars: Episode V - The Empire Strikes Back (1980)   1980     8.8    320105  124 mins.   Action|Adventure|Family|Sci-Fi
tt0372784   Batman Begins (2005)    2005     8.3    316613  140 mins.   Action|Crime|Drama|Thriller
tt0114814   The Usual Suspects (1995)   1995     8.7    306624  106 mins.   Crime|Mystery|Thriller
tt0102926   The Silence of the Lambs (1991) 1991     8.7    293081  118 mins.   Crime|Thriller
tt0120338   Titanic (1997)  1997     7.4    284245  194 mins.   Adventure|Drama|History|Romance

我有一段代码要加载我的数据集并对其进行一些更改

import pandas as pd
import numpy as np

headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres'] 
movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')
movies.head()

one_hot_encoding = movies["genres"].str.get_dummies(sep='|')
movies = pd.concat([movies, one_hot_encoding], axis=1)

movies_top_250 = movies.sort_values('score', ascending=False).head(250)

给出这个

  • 我们想从前250个收视率最高的音乐库中找出最优秀的三种流派
  • 这些类型电影的平均值
  • 我们还要找出这部电影中三种评级最差的电影的平均得分

也许我在想透视表?这里仅使用流派列的子集。

pd.pivot_table(movies_top_250, values=['votes', 'Action', 'Adult'], index='title', aggfunc=np.sum).sort_values('votes', ascending=False)

                                                    Action  Adult   votes
title                                                                    
The Shawshank Redemption (1994)                          0      0  619479
The Dark Knight (2008)                                   1      0  555122
Pulp Fiction (1994)                                      0      0  490065
The Godfather (1972)                                     0      0  474189
Fight Club (1999)                                        0      0  458173
The Lord of the Rings: The Fellowship of the Ri...       1      0  451263
The Matrix (1999)                                        1      0  448114
The Lord of the Rings: The Return of the King (...       1      0  428791
Inception (2010)                                         1      0  385149
The Lord of the Rings: The Two Towers (2002)             1      0  383113
Forrest Gump (1994)                                      0      0  368994

但这并不能说明哪种类型的票数最多。还有

movies.groupby('genres').score.mean()

返回类似

genres
Action                                                   5.837500
Action|Adventure                                         6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy         7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi         6.100000
Action|Adventure|Biography|Crime|History|Western         6.300000
Action|Adventure|Biography|Drama|History                 7.700000

因此,我对此并不能完全理解。对于第一个问题,我正在考虑获得类似的东西

Genre          mean_score  votes_sum
Action           7.837500     103237
Adventure        6.152381     103226
Animation        5.500000     103275

2 个答案:

答案 0 :(得分:0)

设置

import io

import numpy as np
import pandas as pd

colnames = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']

data_url = 'https://bpaste.net/raw/05fa224794e4'

movies = pd.read_csv(data_url, sep="\t", header=None, names=colnames, encoding='UTF-8', index_col='imdbID')

还有一个有用的功能

def arg_nlargest(x, n, use_index=True):
    if isinstance(x, pd.Series):
        x = x.values
    return np.argpartition(-x, n)[:n]

250部顶级电影中的3种顶级流派

首先获得前250名电影:

top250_iloc = arg_nlargest(movies['score'], 250)
movies250 = movies.iloc[top250_iloc]

接下来,我们像您一样将每部电影的流派扩展到指标中

movies250_genre_inds = movies250["genres"].str.get_dummies(sep='|')

天真的方法是遍历指标列,为每种流派收集汇总。

genre_agg = {}
for genre in movies250_genre_inds.columns:
    mask = movies250_genre_inds[genre].astype(bool)
    aggregates = movies250.loc[mask].agg({'score': 'mean', 'votes': 'sum'}) 
    genre_agg[genre] = aggregates.tolist()

genre_agg = pd.DataFrame.from_dict(genre_agg, orient='index', columns=['score_mean', 'votes_sum'])

genre3_iloc = arg_nlargest(genre_agg['score_mean'], 3)
genre3 = genre_agg.iloc[genre3_iloc].sort_values('score_mean', ascending=False)

答案 1 :(得分:0)

您可以使用以下 oneline 解决方案(仅将换行符转为漂亮格式):

movies = \
    (movies.set_index(mv.columns.drop('genres',1).tolist())
    .genres.str.split('|',expand=True)
    .stack()
    .reset_index()
    .rename(columns={0:'genre'})
    .loc[:,['genre','score','votes']]
    .groupby('genre').agg({'score':['mean'], 'votes':['sum']})
    )

              score    votes
               mean      sum
genre                      
Action     8.425714  7912508
Adventure  8.430000  7460632
Animation  8.293333  1769806
Biography  8.393750  2112875
Comedy     8.341509  3166269
...

EXPLANATION

主要问题是True流程在类型上产生的多个one_hot_encoding值。一部电影可以分配给一种或多种类型。因此,您不能按类型正确使用聚合方法。另一方面,按原样使用genres字段将消除您在问题中显示的多个性别结果:

                                                           genres
Action                                                   5.837500
Action|Adventure                                         6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy         7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi         6.100000
Action|Adventure|Biography|Crime|History|Western         6.300000
Action|Adventure|Biography|Drama|History                 7.700000

一种解决方法是,在找到多个性别时复制行。通过将splitexpand方法设置为True的组合,可以创建多个数据帧,然后将它们堆叠。例如,具有2种体裁的电影将出现在2个结果数据帧中,其中每个数据帧代表分配给每种体裁的电影。最后,在解析之后,您可以按性别汇总具有多个功能的信息。我将逐步解释:

1。获得前250部电影(按得分)

加载数据:

import pandas as pd
import numpy as np

headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres'] 
movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')

请注意,您在genres字段中具有空值:

             imdbID             title  year  score  votes   runtime genres
7917  tt0990404  Chop Shop (2007)  2007    7.2   2104  84 mins.    NaN

由于使用Pandas的聚合方法将忽略具有任何空值的行,并且在此字段上我们只有1部电影具有空值,因此可以手动设置(在Imdb上选中):

movies.loc[movies.genres.isnull(),"genres"] = "Drama"

现在,正如您已经显示的,我们需要按得分排名前250部电影:

movies = movies.sort_values('score', ascending=False).head(250)

2。使用带有split的split通过流派创建流派字段

2.1。设置索引

仅将流派字段保留为列,将其他字段保留为索引。这是为了简化流派工作。

movies = movies.set_index(movies.columns.drop('genres',1).tolist())                                                                         
                                                                                     genres
imdbID    title                                 year score votes  runtime                  
tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.     Crime|Drama
tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.     Crime|Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.         Western
tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.  Crime|Thriller
tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.     Comedy|Drama
(250, 1)

2.2。按类型划分

这将从N次拆分迭代中创建N个数据帧。

movies = movies.genres.str.split('|',expand=True)

                                                                                       0  \
imdbID    title                                 year score votes  runtime              
tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.    Crime   
tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.    Crime   
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.  Western   
tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.    Crime   
tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.    Comedy   

                                                                                    1  \
imdbID    title                                 year score votes  runtime               
tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.     Drama   
tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.     Drama   
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.      None   
tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.  Thriller   
tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.      Drama   

...

2.3。堆栈

现在,每部电影都有一个唯一的体裁值,如果分配了一种以上的体裁,一部电影可以有多于1行,您可以堆叠数据帧集。请注意,现在我们有超过250行(662行),但是有250部不同的电影。

movies = movies.stack()

imdbID     title                                  year  score  votes   runtime     
tt0111161  The Shawshank Redemption (1994)        1994  9.2    619479  142 mins.  0      Crime
                                                                                  1      Drama
tt0068646  The Godfather (1972)                   1972  9.2    474189  175 mins.  0      Crime
                                                                                  1      Drama
tt0060196  The Good, the Bad and the Ugly (1966)  1966  9.0    195238  161 mins.  0    Western
dtype: object
(662,)

3。解析

在聚合之前获取合适的数据结构:

# Multiple index to columns
movies = movies.reset_index()

# Name the new column for genre
movies = movies.rename(columns={0:'genre'})

# Only wanted fields to be aggregated
movies = movies.loc[:,['genre','score','votes']]

     genre  score   votes
0    Crime    9.2  619479
1    Drama    9.2  619479
2    Crime    9.2  474189
3    Drama    9.2  474189
4  Western    9.0  195238
(662, 3)

4。聚集

根据您的要求,分数必须按均值汇总,投票应按总和汇总:

movies = movies.groupby('genres').agg({'score':['mean'], 'votes':['sum']})

              score    votes
               mean      sum
genre                      
Action     8.425714  7912508
Adventure  8.430000  7460632
Animation  8.293333  1769806
Biography  8.393750  2112875
Comedy     8.341509  3166269
(21, 2)
相关问题