关于Pandas Dataframe的汇总计算

时间:2013-11-01 22:03:50

标签: python pandas

我有一个看起来像底部的DF(摘录,每个季度有4个地区和日期扩展)

我想创建一个df(按地区),只是最新日期和前一季度和前一年(同一季度)之间的差异

此时区域和Quradate都是索引。

所以我想要的东西(不是真正的近距离):

(['region'] ['Quradate'][-1:-1])-(['region'] ['Quradate'][-2:-2]) 
& (['region']  ['Quradate'][-1:-1])-(['region'] ['Quradate'][-5:-5])  

所以我最终每个区域有两行,第一个是得分(实际上有5个得分)与上一个季度的差异(实际上是5个得分),第二个与前一年的差异。

...卡住

                                                                  Score1      Score2  
region                                           Quradate           
North_Central-Birmingham-Tuscaloosa-Anniston 2010-01-15             47           50
                                             2010-04-15             45           60
                                             2010-07-15             45           40
                                             2010-10-15             42           43
                                             2011-01-15             46           44
                                             2011-04-15             45           45
                                             2011-07-15             45           45
                                             2011-10-15             43           46
                                             2012-01-15             51           55
                                             2012-04-15             53           56
                                             2012-07-15             51           57
                                             2012-10-15             52           58
                                             2013-01-15             50           50
                                             2013-04-15             55           55
                                             2013-07-15             55           56
                                             2013-10-15             51           66   
North_Huntsville-Decatur-Florence            2010-01-15             55           55

3 个答案:

答案 0 :(得分:1)

请点击此处查看解决方案和讨论:Selecting a new dataframe via a multi-indexed frame in Pandas using index names

基本上你只需要前一段时间的差异

df.groupby(level='region').apply(lambda x: x.diff().iloc[-1])

和一年前的差异(4个季度)

df.groupby(level='region').apply(lambda x: x.diff(4).iloc[-1])

答案 1 :(得分:0)

我认为你有点走上正轨。在我看来,我会创建一个函数来计算你要查找的两个值并返回一个数据框。如下所示:

def find_diffs(region):
    score_cols = ['Score1', 'Score2']

    most_recent_date = region.Quradate.max()
    last_quarter = most_recent_date - datetime.timedelta(365/4) # shift by 4 months
    last_year = most_recent_date - datetime.timedelta(365) # shift by a year

    quarter_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_quarter)].diff()
    quarter_score_diff['id'] = 'quarter_diff'

    year_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_year)].diff()
    year_score_diff['id'] = 'year_diff'

    df_temp = quarter_score_diff.append(year_score_diff)
    return df_temp

然后你可以:

DF.groupby(['region']).apply(find_diffs)

结果将是按区域编制的DF,其中每个得分差异都有列,另外一列将每行标识为四分之一或年度差异。

答案 2 :(得分:0)

编写一个函数然后与groupby一起使用绝对是一个选项,另一件容易做的事情是在组中创建数据列表并使用indeces进行计算,这可能是由于规则间隔数据的性质(请记住,只有在数据有规律地间隔时才有效)。这种方法无需真正处理日期。首先,我将重新索引,以便区域在数据框中显示为列,然后我将执行以下操作:

#First I create some data
Dates = pd.date_range('2010-1-1', periods = 14, freq = 'Q')
Regions = ['Western', 'Eastern', 'Southern', 'Norhtern']
df = DataFrame({'Regions': [elem for elem in Regions for x in range(14)], \
            'Score1' : np.random.rand(56), 'Score2' : np.random.rand(56), 'Score3' : np.random.rand(56), \
            'Score4' : np.random.rand(56), 'Score5' : np.random.rand(56)}, index = list(Dates)*4)

# Create a dictionary to hold your data
SCORES = ['Score1', 'Score2', 'Score3', 'Score4', 'Score5']
ValuesDict = {region : {score : [int(), int()] for score in SCORES} for region in df.Regions.unique()}

#This dictionary will contain keys that are your regions, and these will correspond to a dictionary that has keys that are your scores and those correspond to a list of which the fisrt element is the most recent - last quarter calculation, and the second is the most recent - last year calcuation. 

#Now group the data
dfGrouped = df.groupby('Regions')

#Now iterate through the groups creating lists of the underlying data. The data that is at the last index point of the list is by definition the newest (due to the sorting when grouping) and the obervation one year previous to that is - 4 index points away.

for group in dfGrouped:
    Score1List = list(group[1].Score1)
    Score2List = list(group[1].Score2)
    Score3List = list(group[1].Score3)
    Score4List = list(group[1].Score4)
    Score5List = list(group[1].Score5)
    MasterList = [Score1List, Score2List, Score3List, Score4List, Score5List]
    for x in xrange(1, 6):
        ValuesDict[group[0]]['Score' + str(x)][0] = MasterList[x-1][-1] - MasterList[x-1][-2]
        ValuesDict[group[0]]['Score' + str(x)][1] = MasterList[x-1][-1] - MasterList[x-1][-5]

ValuesDict

它有点令人费解,但这是我经常处理这些类型问题的方式。值dict包含您需要的所有数据,但我很难将其放入数据帧。