[ Calculating time elapsed on a groupby object ]
I have some time series data which I want to separate out by category, and further separate when a binary value == 1. I want to calculate the time elapsed per day when the binary value == 1.
Here is the original data sample:
category binary
utctime
2014-10-23 13:15:08 a 0
2014-10-24 16:09:13 b 0
2014-10-24 18:56:01 a 1
2014-10-24 21:42:42 a 1
2014-10-25 00:29:22 a 0
2014-10-25 03:16:02 c 1
2014-10-25 06:02:43 c 1
2014-10-25 08:49:23 c 0
2014-10-25 11:36:03 c 1
2014-10-25 14:22:43 c 1
2014-10-25 17:09:24 d 0
2014-10-25 19:56:05 b 0
2014-10-25 22:42:45 b 0
2014-10-26 01:29:26 e 0
2014-10-26 04:16:15 d 0
2014-10-26 07:02:56 e 1
2014-10-26 09:49:36 e 1
2014-10-26 12:36:16 e 0
2014-10-26 15:22:57 e 0
2014-10-26 18:09:46 d 0
2014-10-26 20:56:26 b 0
2014-10-26 23:43:07 e 0
I started off by filtering the binary column, then grouping by category, but I lose the date index. If I group by index.date (or pd.date_grouper), I don't know how to subgroup into the separate categories.
It feels like the data may be in a particularly unhelpful shape, but I don't know how to make it better - I tried a pivot table with the categories as columns, but since utctimes are unique to categories, that didn't work. Should I take utctime out of the index?
The desired output would be something akin to the following:
category a
date total time binary == 1
2014-10-23 10 minutes
2014-10-24 5 minutes
category b
date total time binary == 1
2014-10-23 1 minutes
2014-10-24 15 minutes
Answer 1
To group by category and index date, you could use
date = df2.index.date
grouped = df2.groupby(['category', date])
Note that groupby can accept a list containing both a string and an array. The
string refers to a column name, while the array acts as a proxy for a virtual
column. The date
s are not a column of df2
and yet you can group by them. Pretty cool, huh?
To find the total number of minutes in each group you could aggregate using a lambda function such as
lambda x: (x.index[-1]-x.index[0])/pd.Timedelta(1, 'm')
(x.index[-1]-x.index[0])
calculates the difference between the first and last timestamp in each group. Note this assumes the index is in sorted order.
The difference (x.index[-1]-x.index[0])
returns a pd.Timedelta
.
Dividing by pd.Timedelta(1, 'm')
returns the total number of minutes.
Note that Jeff's method, which uses g.last()-g.first()
, is likely to be much faster for large DataFrames since it computes all the Timedeltas in one vectorized operation instead of computing each Timedelta with one lambda function call for each row.
import numpy as np
import pandas as pd
df = pd.DataFrame(
[['2014-10-23 13:15:08', 'a', 999.9, 0],
['2014-10-24 16:09:13', 'b', 24.1, 0],
['2014-10-24 18:56:01', 'a', 23.3, 1],
['2014-10-24 21:42:42', 'a', 23.0, 1],
['2014-10-25 00:29:22', 'a', 22.7, 0],
['2014-10-25 03:16:02', 'c', 23.1, 1],
['2014-10-25 06:02:43', 'c', 22.8, 1],
['2014-10-25 08:49:23', 'c', 23.7, 1],
['2014-10-25 11:36:03', 'c', 24.8, 0],
['2014-10-25 14:22:43', 'c', 25.7, 0],
['2014-10-25 17:09:24', 'd', 24.9, 0],
['2014-10-25 19:56:05', 'b', 24.6, 0],
['2014-10-25 22:42:45', 'b', 24.2, 0],
['2014-10-26 01:29:26', 'e', 22.7, 0],
['2014-10-26 04:16:15', 'd', 23.6, 0],
['2014-10-26 07:02:56', 'e', 22.4, 1],
['2014-10-26 09:49:36', 'e', 22.7, 1],
['2014-10-26 12:36:16', 'e', 22.2, 0],
['2014-10-26 15:22:57', 'e', 23.1, 0],
['2014-10-26 18:09:46', 'd', 23.8, 0],
['2014-10-26 20:56:26', 'b', 23.8, 0],
['2014-10-26 23:43:07', 'e', 22.7, 0]],
columns=['utctime', 'category', 'temp', 'binary'])
df = df.set_index('utctime')
df.index = pd.DatetimeIndex(df.index)
df2 = df.loc[df['binary']==1]
date = df2.index.date
grouped = df2.groupby(['category', date])
result = grouped['binary'].agg(
lambda x: (x.index[-1]-x.index[0])/pd.Timedelta(1, 'm'))
print(result)
yields
category
a 2014-10-24 166.683333
c 2014-10-25 333.350000
e 2014-10-26 166.666667
Name: binary, dtype: float64
Answer 2
Using @unutbu data & setup
Add in an extra column that we want to diff as well
In [31]: df2['ts'] = df2.index
In [32]: df2
Out[32]:
category temp binary ts
2014-10-24 18:56:01 a 23.3 1 2014-10-24 18:56:01
2014-10-24 21:42:42 a 23.0 1 2014-10-24 21:42:42
2014-10-25 03:16:02 c 23.1 1 2014-10-25 03:16:02
2014-10-25 06:02:43 c 22.8 1 2014-10-25 06:02:43
2014-10-25 08:49:23 c 23.7 1 2014-10-25 08:49:23
2014-10-26 07:02:56 e 22.4 1 2014-10-26 07:02:56
2014-10-26 09:49:36 e 22.7 1 2014-10-26 09:49:36
A slightly more general way of grouping
In [33]: g = df2.groupby(['category',pd.Grouper(freq='D',level=0)])
This will be much more performant, though YMMV.
In [34]: g.last()-g.first()
Out[34]:
temp binary ts
category
a 2014-10-24 -0.3 0 02:46:41
c 2014-10-25 0.6 0 05:33:21
e 2014-10-26 0.3 0 02:46:40
In [35]: result = g.last()-g.first()
In [46]: result['ts'] = result['ts'] / Timedelta('1m')
In [47]: result
Out[47]:
temp binary ts
category
a 2014-10-24 -0.3 0 166.683333
c 2014-10-25 0.6 0 333.350000
e 2014-10-26 0.3 0 166.666667