Python - 使用条件计算DataFrame行数的更快方法?

时间:2017-09-19 15:05:16

标签: python python-2.7 pandas

我想计算每个bin中pandas DataFrame行的数量,并列出计数。

我认为应该有比我更快的方法。你能给我一些建议吗?

script.py

import pandas

binwidth = 10
data = pandas.read_csv('sample.csv', sep=' ', names=['time', 'value'], header=None, comment='#')

mylist = []

for item in data.iterrows():
    index = item[1]['time']/binwidth
    if len(mylist) <= index:
        mylist.append(1)
    else:
        mylist[index] += 1

print mylist # which outputs [8, 4, 4]

sample.csv

# time value
1 a
2 b
3 c
4 d
6 e
7 f
8 g
9 h
10 i
12 j
15 k
17 l
21 m
22 n
26 o
29 p

3 个答案:

答案 0 :(得分:2)

您可以使用pandas.cut

执行此操作
import pandas

binwidth = 10
data = pandas.read_csv('sample.csv', sep=' ', names=['time', 'value'], header=None, comment='#')

max_bin_edge = int(np.ceil(data['time'].max()/binwidth)*binwidth) + 1
bin_edges = list(range(0, max_bin_edge, binwidth))

bins = pd.cut(data['time'], bins=bin_edges, right=False)

bin_counts = bins.groupby(bins).count()

print(bin_counts)

这也将为您提供bin边缘

time
[0, 10)     8
[10, 20)    4
[20, 30)    4
Name: time, dtype: int64

答案 1 :(得分:0)

我想这可以胜任:

# set the time column as index for the groupby function
df = pandas.read_csv('sample.csv', sep=' ', names=['time', 'value'], 
    header=None, comment='#', index_col=['time'])  

binwidth = 10
groupped_df = df.groupby(lambda x: int(x/binwidth)).count()
mylist = groupped_df['value'].tolist()

答案 2 :(得分:0)

使用

In [1086]: df.groupby(df.time//10).time.count().values.tolist()
Out[1086]: [8L, 4L, 4L]

或者,

In [1092]: df.groupby(df.time//10).size().tolist()
Out[1092]: [8L, 4L, 4L]

或者,Numpy版

In [1096]: np.bincount(df.time//10).tolist()
Out[1096]: [8L, 4L, 4L]

详细

In [1087]: df    
Out[1087]:       
    time value   
0      1     a   
1      2     b   
2      3     c   
3      4     d   
4      6     e   
5      7     f   
6      8     g   
7      9     h   
8     10     i   
9     12     j   
10    15     k   
11    17     l   
12    21     m   
13    22     n   
14    26     o   
15    29     p   
相关问题