Pandas Groupby:聚合和条件

时间:2018-04-18 19:56:02

标签: python pandas aggregate pandas-groupby

我正在按PD数据框中的项目日期对进行分组,并希望使用lambda将一些自定义条件函数添加到更大的聚合函数中。

使用提示here,我可以执行以下操作,它可以正常工作并计算给定列中的正值和负值。

item_day_count=item_day_group['PriceDiff_pct'].agg({'Pos':lambda val: (val > 0).sum(),'Neg':lambda val: (val <= 0).sum()}).reset_index()

我还可以做一个不同的聚合,其中包含预先构建的聚合和返回正确统计数据的自定义百分位函数:

item_day_count_v2=item_day_group['PriceDiff_pct'].agg(['count','min',percentile(25),'mean','median',percentile(75),'max']).reset_index()

但是我无法弄清楚如何将这些组合成一个更大的函数 - 当我尝试以下内容时,我得到错误:AttributeError: 'DataFrameGroupBy' object has no attribute 'name'

item_day_count_v3=item_day_group['PriceDiff_pct'].agg(['count',{'Pos_Return':lambda val: (val > 0).sum(),'Neg_Return':lambda val: (val <= 0).sum()},'min',percentile(25),'mean','median',percentile(75),'max']).reset_index() 

有谁知道如何组合这些功能?看起来像是我很接近考虑两个单独的工作。谢谢你的帮助!

3 个答案:

答案 0 :(得分:0)

我建议不要在dict和本机聚合器中组合定义的func。您可以将它们作为具有函数名称和函数的元组列表传递,如下所示:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#define BUFSIZE 25

int
main(int argc, char *argv[])
{

    srand(time(NULL));
    pid_t pid;
    int mypipefd[2];
    int ret;
    char buf[BUFSIZE];
    int output;
    int stream;
    int nbytes;

    ret = pipe(mypipefd);
    if (ret == -1) {
        perror("pipe error");
        exit(1);
    }

    pid = fork();
    if (pid == -1) {
        perror("FORK ERROR...");
        exit(2);
    }

    if (pid == 0) {
        /* CHILD */
        printf(" Child process...\n");
        stream = open("input.txt", O_RDONLY);
        if (close(mypipefd[0]) == -1) {
            perror("ERROR CLOSING PIPE");
            exit(3);
        }

        while ((nbytes = read(stream, buf, BUFSIZE)) > 0) {
            sleep(rand() % 2);

            // NOTE/FIX: writing to pipes _can_ generate a _short_ write. that
            // is, (e.g.) if the length given to write is 20, the return value
            // may be only 15. this means that the remaining 5 bytes must be
            // sent in a second/subsequent write
            int off;
            int wlen;
            for (off = 0;  nbytes > 0;  off += wlen, nbytes -= wlen) {
                wlen = write(mypipefd[1], buf + off, nbytes);
                if (wlen < 0) {
                    perror("ERROR WRITING TO FILE");
                    exit(3);
                }
                if (wlen == 0)
                    break;
            }
        }

        if (close(stream) == -1) {
            perror("ERROR CLOSING STREAM");
            exit(4);
        }

        // NOTE/FIX: child must close it's side of the pipe
        // NOTE/ERRCODE: check error code here
        close(mypipefd[1]);
    }

    else {
        /* PARENT */
        printf(" Parent process...\n");

        // NOTE/FIX: this must be closed _before_ the read loop -- holding it
        // open prevents parent from seeing EOF on pipe
        if (close(mypipefd[1]) == -1) {
            perror("ERROR CLOSING PIPE");
            exit(6);
        }

        // NOTE/ERRCODE: this should be checked for -1 (i.e. output file
        // could not be opened for file permission, etc. or other reasons
        // similar to those for the file write below)
        output = open("output.txt", O_CREAT | O_WRONLY, 00777);

        // NOTE/FIX: we read one less than buffer size to allow for adding an
        // artificial zero byte at the end
        while ((nbytes = read(mypipefd[0], buf, BUFSIZE - 1)) > 0) {
            // NOTE/ERRCODE: error handling _could_ be added here but it would
            // be rare (e.g. filesystem has an I/O error because it's full or
            // marked R/O because of an I/O error on the underlying disk)
            write(output, buf, nbytes);

            // write partial buffer to stdout
            buf[nbytes] = 0;
            printf("buf: %s\n",buf);
        }

        if (close(output) == -1) {
            perror("ERROR CLOSING OUTPUT");
            exit(5);
        }

        // NOTE/FIX: this is missing (prevents orphan/zombie child process)
        // NOTE/ERRCODE: yes, this _can_ have an error return but here it's
        // unlikely because we _know_ that pid is valid
        // what can be done is to do:
        //   int status;
        //   waitpid(pid,&status,0)
        // then process the return code from the child using the W* macros
        // provided (e.g. WIFEXITED, WSTATUS) on status
        waitpid(pid, NULL, 0);
    }

    return 0;
}

函数名称将是列名。

答案 1 :(得分:0)

来自pandas docs的aggregate()方法:

  

接受的组合是:

     
      
  • 字符串函数名称

  •   
  • 功能

  •   
  • 功能列表

  •   
  • 列名称的词典 - &gt;功能(或功能列表)

  •   

我会说它不支持所有组合。

所以,你可以试试这个:

首先获取dict中的所有内容,然后使用该dict获取。

# The functions to agg on every column.
agg_dict = dict((c, ['count','min',percentile(25),'mean','median',percentile(75),'max']) for c in item_day.columns.values)

# Append to the dict the column-specific functions.
agg_dict['Pos_Return'] = lambda val: (val > 0).sum()
agg_dict['Neg_Return'] = lambda val: (val <= 0).sum()

# Agg using the dict.
item_day_group['PriceDiff_pct'].agg(agg_dict)

答案 2 :(得分:0)

正如其他人所说,你不能在agg()方法中将命名函数与dict混合使用。

这是一个想要你想要的实用方法。让我们来构建一些数据。

df = pd.DataFrame({'A':['x', 'y']*3,
                   'B':[10,20,30,40,50,60]})

df
Out[38]: 
   A   B
0  x  10
1  y  20
2  x  30
3  y  40
4  x  50
5  y  60

定义一个函数来计算大于或等于30的值。

def ge30(x):
    return (x>=30).sum()

现在在groupby().agg()

中使用自定义功能
df.groupby('A').agg(['sum', 'mean', ge30])
Out[40]: 
     B          
   sum mean ge30
A               
x   90   30    2
y  120   40    2