Pandas - calculate percent of total given ranges

时间:2018-01-23 19:18:53

标签: python pandas

I'd like get a percentage of the occurrences of speed data falling into a range as a percentage. As an example, 5% of the speed data is between 0 and 5, 10% is between 5 and 10, etc. I'd also like the ability to resample the output into any frequency (entire period, daily, monthly, etc)

I have a DataFrame that looks like this:

df = pd.DataFrame({'id': '1234',
                   'datetime': pd.date_range('2017-01-01', '2018-01-01', freq='H'),
                   'speed': np.random.randint(0, 5000, df.shape[0])})
df['speed'] = df['speed'] / 100.0

speedintervals = [0,3,5,9,15,21]
frequency = 'D' # for daily aggregation of data
# or frequency = 'P' for entire period

DataFrame looks like this:

    datetime             id     speed
0   2017-01-01 00:00:00 1234    17.08
1   2017-01-01 01:00:00 1234    16.30
2   2017-01-01 02:00:00 1234    12.74
3   2017-01-01 03:00:00 1234    39.89
4   2017-01-01 04:00:00 1234    34.33
5   2017-01-01 05:00:00 1234    22.76
6   2017-01-01 06:00:00 1234    13.72
...

I'm imagining that I'd set datetime to index and do a resample of sorts, but not sure how to build out the data. Ultimately, I want the data to look like this:

For entire period:

id      start_date      end_date    0<=3    3<=9    9<=15   15<=21  >21
1234    1/1/17 0:00 1/1/18 23:00    0.49    0.13    0.18    0.17    0.00

For daily frequency:

id      periodEnd   0<=3    3<=9    9<=15   15<=21  >21
1234    1/1/18  0.49    0.13    0.18    0.17    0.00
1234    1/2/18  0.50    0.14    0.17    0.16    0.00
1234    1/3/18  0.25    0.10    0.25    0.25    0.15
...

any thoughts?

1 个答案:

答案 0 :(得分:1)

Here one way to do it.

speedintervals = [0,3,5,9,15,21,100]
df["interval"] = pd.cut(df["speed"],bins=speedintervals)
result = (df.groupby([pd.Grouper(key="datetime",freq="D"),"interval"])["interval"].count()
          .unstack(0).T.fillna(0)
          )
  • add 100 to your list to capture the high speeds.
  • Then use the cut method to group the speeds into intervals
  • Group by the datetime and then the interval and then count
  • This creates a multindex so you have to unstack it do get the format you want.

You could use a pivot table instead of a groupby, but it's better to use group for dates.

If you want the normalized result you can do

result.div(result.sum(axis=1),axis="rows")

For the whole time period

pd.cut(df["speed"],bins=speedintervals).value_counts()