TAGS :Viewed: 14 - Published at: a few seconds ago

[ Calculating time elapsed on a groupby object ]

I have some time series data which I want to separate out by category, and further separate when a binary value == 1. I want to calculate the time elapsed per day when the binary value == 1.

Here is the original data sample:

                       category  binary
utctime                                               
2014-10-23 13:15:08           a   0
2014-10-24 16:09:13           b   0
2014-10-24 18:56:01           a   1
2014-10-24 21:42:42           a   1
2014-10-25 00:29:22           a   0
2014-10-25 03:16:02           c   1
2014-10-25 06:02:43           c   1
2014-10-25 08:49:23           c   0
2014-10-25 11:36:03           c   1
2014-10-25 14:22:43           c   1
2014-10-25 17:09:24           d   0
2014-10-25 19:56:05           b   0
2014-10-25 22:42:45           b   0
2014-10-26 01:29:26           e   0
2014-10-26 04:16:15           d   0
2014-10-26 07:02:56           e   1
2014-10-26 09:49:36           e   1
2014-10-26 12:36:16           e   0
2014-10-26 15:22:57           e   0
2014-10-26 18:09:46           d   0
2014-10-26 20:56:26           b   0
2014-10-26 23:43:07           e   0

I started off by filtering the binary column, then grouping by category, but I lose the date index. If I group by index.date (or pd.date_grouper), I don't know how to subgroup into the separate categories.

It feels like the data may be in a particularly unhelpful shape, but I don't know how to make it better - I tried a pivot table with the categories as columns, but since utctimes are unique to categories, that didn't work. Should I take utctime out of the index?

The desired output would be something akin to the following:

category a   
date              total time binary == 1
2014-10-23          10 minutes
2014-10-24          5 minutes

category b   
date              total time binary == 1
2014-10-23          1 minutes
2014-10-24          15 minutes

Answer 1


To group by category and index date, you could use

date = df2.index.date
grouped = df2.groupby(['category', date])

Note that groupby can accept a list containing both a string and an array. The string refers to a column name, while the array acts as a proxy for a virtual column. The dates are not a column of df2 and yet you can group by them. Pretty cool, huh?

To find the total number of minutes in each group you could aggregate using a lambda function such as

lambda x: (x.index[-1]-x.index[0])/pd.Timedelta(1, 'm')

(x.index[-1]-x.index[0]) calculates the difference between the first and last timestamp in each group. Note this assumes the index is in sorted order. The difference (x.index[-1]-x.index[0]) returns a pd.Timedelta. Dividing by pd.Timedelta(1, 'm') returns the total number of minutes.

Note that Jeff's method, which uses g.last()-g.first(), is likely to be much faster for large DataFrames since it computes all the Timedeltas in one vectorized operation instead of computing each Timedelta with one lambda function call for each row.


import numpy as np
import pandas as pd

df = pd.DataFrame(
    [['2014-10-23 13:15:08', 'a', 999.9, 0],
     ['2014-10-24 16:09:13', 'b', 24.1, 0],
     ['2014-10-24 18:56:01', 'a', 23.3, 1],
     ['2014-10-24 21:42:42', 'a', 23.0, 1],
     ['2014-10-25 00:29:22', 'a', 22.7, 0],
     ['2014-10-25 03:16:02', 'c', 23.1, 1],
     ['2014-10-25 06:02:43', 'c', 22.8, 1],
     ['2014-10-25 08:49:23', 'c', 23.7, 1],
     ['2014-10-25 11:36:03', 'c', 24.8, 0],
     ['2014-10-25 14:22:43', 'c', 25.7, 0],
     ['2014-10-25 17:09:24', 'd', 24.9, 0],
     ['2014-10-25 19:56:05', 'b', 24.6, 0],
     ['2014-10-25 22:42:45', 'b', 24.2, 0],
     ['2014-10-26 01:29:26', 'e', 22.7, 0],
     ['2014-10-26 04:16:15', 'd', 23.6, 0],
     ['2014-10-26 07:02:56', 'e', 22.4, 1],
     ['2014-10-26 09:49:36', 'e', 22.7, 1],
     ['2014-10-26 12:36:16', 'e', 22.2, 0],
     ['2014-10-26 15:22:57', 'e', 23.1, 0],
     ['2014-10-26 18:09:46', 'd', 23.8, 0],
     ['2014-10-26 20:56:26', 'b', 23.8, 0],
     ['2014-10-26 23:43:07', 'e', 22.7, 0]], 
    columns=['utctime', 'category', 'temp', 'binary'])
df = df.set_index('utctime')
df.index = pd.DatetimeIndex(df.index)
df2 = df.loc[df['binary']==1]
date = df2.index.date
grouped = df2.groupby(['category', date])
result = grouped['binary'].agg(
    lambda x: (x.index[-1]-x.index[0])/pd.Timedelta(1, 'm'))
print(result)

yields

category            
a         2014-10-24    166.683333
c         2014-10-25    333.350000
e         2014-10-26    166.666667
Name: binary, dtype: float64

Answer 2


Using @unutbu data & setup

Add in an extra column that we want to diff as well

In [31]: df2['ts'] = df2.index

In [32]: df2
Out[32]: 
                    category  temp  binary                  ts
2014-10-24 18:56:01        a  23.3       1 2014-10-24 18:56:01
2014-10-24 21:42:42        a  23.0       1 2014-10-24 21:42:42
2014-10-25 03:16:02        c  23.1       1 2014-10-25 03:16:02
2014-10-25 06:02:43        c  22.8       1 2014-10-25 06:02:43
2014-10-25 08:49:23        c  23.7       1 2014-10-25 08:49:23
2014-10-26 07:02:56        e  22.4       1 2014-10-26 07:02:56
2014-10-26 09:49:36        e  22.7       1 2014-10-26 09:49:36

A slightly more general way of grouping

In [33]: g = df2.groupby(['category',pd.Grouper(freq='D',level=0)])

This will be much more performant, though YMMV.

In [34]: g.last()-g.first()
Out[34]: 
                     temp  binary       ts
category                                  
a        2014-10-24  -0.3       0 02:46:41
c        2014-10-25   0.6       0 05:33:21
e        2014-10-26   0.3       0 02:46:40

In [35]: result = g.last()-g.first()                   

In [46]: result['ts'] = result['ts'] / Timedelta('1m')

In [47]: result
Out[47]: 
                     temp  binary          ts
category                                     
a        2014-10-24  -0.3       0  166.683333
c        2014-10-25   0.6       0  333.350000
e        2014-10-26   0.3       0  166.666667