5

While preparing for an interview, I have come across an SQL question and I hope to get some insight as to how to better answer it.

Given timestamps, userid, how to determine the number of users who are active everyday in a week?

There's very little to it, but that's the question in front of me.

Selfish
  • 6,023
  • 4
  • 44
  • 63
user3362840
  • 151
  • 1
  • 2
  • 11

6 Answers6

7

I'm going to demonstrate such an idea based on what makes most sense to me and the way I would reply if the question was presented same as here:

First, let's assume a data set as such, we will name the table logins:

+---------+---------------------+
| user_id |   login_timestamp   |
+---------+---------------------+
|       1 | 2015-09-29 14:05:05 |
|       2 | 2015-09-29 14:05:08 |
|       1 | 2015-09-29 14:05:12 |
|       4 | 2015-09-22 14:05:18 |
|   ...   |          ...        |
+---------+---------------------+

There may be other columns, but we don't mind those.

First of all we should determine the borders of that week, for that we can use ADDDATE(). Combined with the idea that today's date-today's week-day (MySQL's DAYOFWEEK()), is sunday's date.

For instance: If today is Wednesday the 10th, Wed - 3 = Sun, thus 10 - 3 = 7, and we can expect Sunday to be the 7th.

We can get WeekStart and WeekEnd timestamps this way:

SELECT
DATE_FORMAT(ADDDATE(CURDATE(), INTERVAL 1-DAYOFWEEK(CURDATE()) DAY), "%Y-%m-%d 00:00:00") WeekStart, 
DATE_FORMAT(ADDDATE(CURDATE(), INTERVAL 7-DAYOFWEEK(CURDATE()) DAY), "%Y-%m-%d 23:59:59") WeekEnd;

Note: in PostgreSQL there's a DATE_TRUNC() function which returns the beginning of a specified time unit, given a date, such as week start, month, hour, and so on. But that's not available in MySQL.

Next, let's utilize WeekStart and weekEnd in order to clice our data set, in this example I'll just show how to filter, using hard coded dates:

SELECT *
FROM `logins`
WHERE login_timestamp BETWEEN '2015-09-29 14:05:07' AND '2015-09-29 14:05:13'

This should return our data set sliced, with only relevant results:

+---------+---------------------+
| user_id |   login_timestamp   |
+---------+---------------------+
|       2 | 2015-09-29 14:05:08 |
|       1 | 2015-09-29 14:05:12 |
+---------+---------------------+

We can then reduce our result set to only the user_ids, and filter out duplicates. then count, this way:

SELECT COUNT(DISTINCT user_id)
FROM `logins`
WHERE login_timestamp BETWEEN '2015-09-29 14:05:07' AND '2015-09-29 14:05:13'

DISTINCT will filter out duplicates, and count will return just the amount.

Combined, this becomes:

SELECT COUNT(DISTINCT user_id)
FROM `logins`
WHERE login_timestamp 
    BETWEEN DATE_FORMAT(ADDDATE(CURDATE(), INTERVAL 1- DAYOFWEEK(CURDATE()) DAY), "%Y-%m-%d 00:00:00") 
        AND DATE_FORMAT(ADDDATE(CURDATE(), INTERVAL 7- DAYOFWEEK(CURDATE()) DAY), "%Y-%m-%d 23:59:59")

Replace CURDATE() with any timestamp in order to get that week's user login count.


But I need to break this down to days, I hear you cry. Of course! and this is how:

First, let's translate our over-informative timestamps to just the date data. We add DISTINCT because we don't mind the same user logging in twice the same day. we count users, not logins, right? (note we step back here):

SELECT DISTINCT user_id, DATE_FORMAT(login_timestamp, "%Y-%m-%d")
FROM `logins`

This yields:

+---------+-----------------+
| user_id | login_timestamp |
+---------+-----------------+
|       1 | 2015-09-29      |
|       2 | 2015-09-29      |
|       4 | 2015-09-22      |
|   ...   |        ...      |
+---------+-----------------+

This query, we will wrap with a second, in order to count appearances of every date:

SELECT `login_timestamp`, count(*) AS 'count'
FROM (SELECT DISTINCT user_id, DATE_FORMAT(login_timestamp, "%Y-%m-%d") AS `login_timestamp` FROM `logins`) `loginsMod`
GROUP BY `login_timestamp`

We use count and a grouping in order to get the list by date, which returns:

+-----------------+-------+
| login_timestamp | count |
+-----------------+-------+
| 2015-09-29      | 1     +
| 2015-09-22      | 2     +
+-----------------+-------+

And after all the hard work, both combined:

SELECT `login_timestamp`, COUNT(*)
FROM (
SELECT DISTINCT user_id, DATE_FORMAT(login_timestamp, "%Y-%m-%d") AS `login_timestamp`
FROM `logins`
WHERE login_timestamp BETWEEN DATE_FORMAT(ADDDATE(CURDATE(), INTERVAL 1- DAYOFWEEK(CURDATE()) DAY), "%Y-%m-%d 00:00:00") AND DATE_FORMAT(ADDDATE(CURDATE(), INTERVAL 7- DAYOFWEEK(CURDATE()) DAY), "%Y-%m-%d 23:59:59")) `loginsMod`
GROUP BY `login_timestamp`;

Will give you a daily breakdown of logins per-day in this week. Again, replace CURDATE() to get a different week.


As for the users themselves who logged in, let's combine the same stuff in a different order:

SELECT `user_id`
FROM (
    SELECT `user_id`, COUNT(*) AS `login_count`
    FROM (
        SELECT DISTINCT `user_id`, DATE_FORMAT(`login_timestamp`, "%Y-%m-%d")
        FROM `logins`) `logins`
    GROUP BY `user_id`) `logincounts`
WHERE `login_count` > 6

I have two inner queries, the first is logins:

SELECT DISTINCT `user_id`, DATE_FORMAT(`login_timestamp`, "%Y-%m-%d")
FROM `logins`

Will provide the list of users, and the days when they logged in on, without duplicates.

Then we have logincounts:

SELECT `user_id`, COUNT(*) AS `login_count`
FROM `logins` -- See previous subquery.
GROUP BY `user_id`) `logincounts`

Will return the same list, with a count of how many logins each user had.

And lastly: SELECT user_id FROM logincounts -- See previous subquery. WHERE login_count > 6

Filtering our those who didn't login 7 times, and dropping the date column.


This kinda got long, but I think it's rife with ideas and I think it may definitely help answering in an interesting way in a work interview. :)

Selfish
  • 6,023
  • 4
  • 44
  • 63
  • 23:59:59 is not the end of a day. Timestamps can be of much finer sub-second precision. Please stop using 23:59:59 as it will be the cause of bugs, somewhere, someday. Due to this I also advise that you avoid using between, refer to [Bad habits to kick : mis-handling date / range queries](https://sqlblog.org/2009/10/16/bad-habits-to-kick-mis-handling-date-range-queries) – Paul Maxwell Nov 28 '17 at 01:08
5
create table fbuser(id integer, date date); 
insert into fbuser(id,date)values(1,'2012-01-01'); 
insert into fbuser(id,date)values(1,'2012-01-02'); 
insert into fbuser(id,date)values(1,'2012-01-01'); 
insert into fbuser(id,date)values(1,'2012-01-01'); 
insert into fbuser(id,date)values(1,'2012-01-01'); 
insert into fbuser(id,date)values(1,'2012-01-01'); 
insert into fbuser(id,date)values(1,'2012-01-02'); 
insert into fbuser(id,date)values(1,'2012-01-03'); 
insert into fbuser(id,date)values(1,'2012-01-04'); 
insert into fbuser(id,date)values(1,'2012-01-05'); 
insert into fbuser(id,date)values(1,'2012-01-06'); 
insert into fbuser(id,date)values(1,'2012-01-07');
insert into fbuser(id,date)values(4,'2012-01-08');
insert into fbuser(id,date)values(4,'2012-01-08');
insert into fbuser(id,date)values(1,'2012-01-08');
insert into fbuser(id,date)values(1,'2012-01-09');
select * from fbuser;
id |    date    
----+------------
 1 | 2012-01-01
 1 | 2012-01-02
 1 | 2012-01-01
 1 | 2012-01-01
 1 | 2012-01-01
 1 | 2012-01-01
 1 | 2012-01-02
 1 | 2012-01-03
 1 | 2012-01-04
 1 | 2012-01-05
 1 | 2012-01-06
 1 | 2012-01-07
 2 | 2012-01-07
 3 | 2012-01-07
 4 | 2012-01-07
 4 | 2012-01-08
 4 | 2012-01-08
 1 | 2012-01-08
 1 | 2012-01-09

select id,count(DISTINCT date) from fbuser 
where date BETWEEN '2012-01-01' and '2012-01-07' 
group by id having count(DISTINCT date)=7

 id | count 
----+-------
  1 |     7
(1 row)

Query counts unique dates logged in by user for the given period and returns id with 7 occurrences. If you have time also in your date you can use date_format.

Dwipam Katariya
  • 144
  • 1
  • 7
  • I like this answer. One observation is that the OP's problem statement has timestamp instead of date, so for anyone else looking at this answer can just substitute date --> DATE(login_timestamp), and give the COUNT an alias so that you don't have to also write it out in the HAVING clause – Nic Scozzaro Aug 31 '18 at 05:04
0

I tried this in Teradata and here is the SQL. First, get the User unique to a date, then check, if the user is present for 7 days.

SELECT src.USER_ID
  ,COUNT(*) CNT
FROM (SELECT USER_ID
      ,CAST(LOGIN_TIMESTAMP AS DATE FORMAT 'YYYY-MM-DD') AS LOGIN_DT
  FROM src_table
  WHERE LOGIN_TIMESTAMP BETWEEN '2017-11-12 00:00:00' AND '2017-11-18 23:59:59'
  GROUP BY 1,2
  )src GROUP BY 1 HAVING CNT = 7;
INSERT INTO src_table VALUES (1,'2017-11-12 10:10:10');
INSERT INTO src_table VALUES (1,'2017-11-13 10:10:10');
INSERT INTO src_table VALUES (1,'2017-11-13 11:10:10');
INSERT INTO src_table VALUES (1,'2017-11-13 12:10:10');
INSERT INTO src_table VALUES (1,'2017-11-14 10:10:10');
INSERT INTO src_table VALUES (1,'2017-11-15 10:10:10');
INSERT INTO src_table VALUES (1,'2017-11-16 10:10:10');
INSERT INTO src_table VALUES (1,'2017-11-17 10:10:10');
INSERT INTO src_table VALUES (1,'2017-11-18 10:10:10');

INSERT INTO src_table VALUES (2,'2017-11-12 01:10:10');
INSERT INTO src_table VALUES (2,'2017-11-13 13:10:10');
INSERT INTO src_table VALUES (2,'2017-11-14 14:10:10');
INSERT INTO src_table VALUES (2,'2017-11-15 12:10:10');


INSERT INTO src_table VALUES (5,'2017-11-12 01:10:10');
INSERT INTO src_table VALUES (5,'2017-11-13 02:10:10');
INSERT INTO src_table VALUES (5,'2017-11-14 03:10:10');
INSERT INTO src_table VALUES (5,'2017-11-15 04:10:10');
INSERT INTO src_table VALUES (5,'2017-11-16 05:10:10');
INSERT INTO src_table VALUES (5,'2017-11-17 06:10:10');

INSERT INTO src_table VALUES (8,'2017-11-12 04:10:10');
INSERT INTO src_table VALUES (8,'2017-11-13 05:10:10');
INSERT INTO src_table VALUES (8,'2017-11-14 06:10:10');
INSERT INTO src_table VALUES (8,'2017-11-15 01:10:10');
INSERT INTO src_table VALUES (8,'2017-11-16 02:10:10');
INSERT INTO src_table VALUES (8,'2017-11-17 03:10:10');
INSERT INTO src_table VALUES (8,'2017-11-18 03:10:10');
Jegan V
  • 1
  • 2
0

With given data of: userid and timestamp; How does one calculate the number of "active users" on each day in a week?

The problem of course is that there might be no logins at all, or none on certain days in a week, so the basic solution to such a requirement is that you must have a series of dates to compare the logins against.

There are a wide variety of ways to generate the dates of a week and the method one chooses would depend on 2 main factors:

  1. How often do I need these (or similar) results?
  2. the platform I am using. (For example it is very easy to "generate a series" using Postgres but MySQL does not offer such a feature whereas recently MariaDB has introduced series tables to help solve such needs. So knowing your platform's capabilities will affect how you solve this.)

IF I need to do this regularly (which I assume will be true) then I would create a "calendar table" of one row per day for a reasonable extensive period (say 10 years) which is only approx 3652 rows, with its primary key as the date column. In this table we can also store the "week_number" using the week() function which makes week by week reporting simpler (and we can add other columns in this table as well).

So, assuming I have built the calendar table containing each date and a week number then we can take the week number from today's date, subtract 1, and gather the needed login data like this:

select
      c.caldate, count(distinct l.userid) as user_logins
from calendar_table as c
left join login_table l on l.timestamp >= c.caldate and l.timestamp < date_add(c.caldate,INTERVAL 1 DAY)
where c.week_number = WEEK(curdate())-1
group by c.caldate

How did I create the calendar table?

Well as said earlier there are a variety of methods, and for MySQL there are options available here: How to populate a table with a range of dates?

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
0

This works for me

select a.user_id, count(a.user_id) as active_time_in_days
from
(
select user_id, login_time, lead(login_time) over (partition by user_id order by login_time asc )  as next_day
from dev.login_info
group by 1,2
order by user_id, login_time asc
)a where a.login_time + interval '1 day' = next_day
group by 1;
Preti
  • 1
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 07 '22 at 22:49
-1

How about this? I tried it and it works.

select yearweek(ts) as yearwk, user_id,
 count(user_id) as counts 
 from log
 group by 1,2
 having count(user_id) =7;
Miknash
  • 7,888
  • 3
  • 34
  • 46
  • 1
    Correct query: `SELECT user_id, yearweek(ts) as yearwk FROM login GROUP by 1,2 HAVING count(distinct DATE(ts))=7` group by week and make sure that you login once every day. Your query right now is just looking if you login 7 times which could all of them be the same day. – zipp Apr 16 '17 at 01:09