Saturday, January 5, 2013

Count records based on a criteria for each day in postgres

Suppose we have a table in the database, where the last login times of the users are stored, and we have some data in it, like the following table:

userid last_login
651651 2013-01-05 21:39:45
894156 2013-01-04 11:31:00
132546 2013-01-04 15:01:49
654713 2013-01-04 01:05:16
654231 2013-01-02 16:40:57
687413 2012-12-31 23:40:57

We would like to know on each day, how many users logged in last, so wee need an SQL to list all days and select the count of users last logged in on that day.

The first task is to find an SQL with which consecutive days can be listed.
The query we are looking for is the following:

select i::date from 
 generate_series('2012-12-30', '2013-01-05', '1 day'::interval) i;
This produces the output:

day
2012-12-30
2012-12-31
2013-01-01
2013-01-02
2013-01-03
2013-01-04
2013-01-05

Great, now we need a criteria for the last_login for each day. The previous query generates timestamps with 00:00:00.000, so we need an interval with the current day as the beginning, and the next day as the end of the interval we are looking for.
Withe the help of the query

select (i::date + '1 day'::interval) from 
 generate_series('2012-12-30', '2013-01-05', '1 day'::interval) i;
we will be able to accomplish this task. Its output is quite similar with the previous one, but in this case the result set is shifted into the future with one day (and the precision changed, but this is not important for our point of view, because we could achieved the same precision with our first query selecting i:timestamp instead of i:date).

day
2012-12-31 00:00:00.000
2013-01-01 00:00:00.000
2013-01-02 00:00:00.000
2013-01-03 00:00:00.000
2013-01-04 00:00:00.000
2013-01-05 00:00:00.000
2013-01-06 00:00:00.000

Superb, we now only need to combine the queries to build our desired selection:

select to_char(i::date, 'YYYY.MM.DD') as last_login_day, 
count(u.userid) as users
from generate_series('2012-12-30', '2013-01-05', '1 day'::interval) i,
 users u
where (i::date + '1 day'::interval) > u.last_login 
 and i::date < u.last_login 
group by last_login_day order by last_login_day:

The result shows the correct values, but unfortunately, the days which none of the users chose to be their last login date are not show up in the result set:

last_login_day users
2012-12-31 1
2013-01-02 1
2013-01-04 3
2013-01-05 1

This is actually not a big problem, and can be solved by using a union. We create an other selection for displaying all days with 0 user counts, and take the union of the two results by grouping by the days and summing up the users (of course the additional order by is needed as well):

select tmp.last_login_day, sum(tmp.users) from (
 (
 select to_char(i::date, 'YYYY.MM.DD') as last_login_day, 
   count(u.userid) as users
 from generate_series('2012-12-30', '2013-01-05', '1 day'::interval)
   i, users u
 where (i::date + '1 day'::interval) > u.last_login 
   and i::date < u.last_login
 group by last_login_day order by last_login_day
 ) union (
 select to_char(i::date, 'YYYY.MM.DD') as last_login_day,
   0 as users
 from generate_series('2012-12-30', '2013-01-05', '1 day'::interval) i
 )
) as tmp group by 1 order by 1

The results are what we wanted to achieve in the first place:

last_login_day users
2012-12-30 0
2012-12-31 1
2013-01-01 0
2013-01-02 1
2013-01-03 0
2013-01-04 3
2013-01-05 1

No comments:

Post a Comment