Database Administrators Asked on October 28, 2021
Using postgresql 9.6.
The table has user sessions and I need distinct non overlapping sessions printed.
CREATE TABLE SESSIONS(
id serial NOT NULL PRIMARY KEY,
ctn INT NOT NULL,
day DATE NOT NULL,
f_time TIME(0) NOT NULL,
l_time TIME(0) NOT NULL
);
INSERT INTO SESSIONS(id, ctn, day, f_time, l_time)
VALUES
(1, 707, '2019-06-18', '10:48:25', '10:56:17'),
(2, 707, '2019-06-18', '10:48:33', '10:56:17'),
(3, 707, '2019-06-18', '10:53:17', '11:00:49'),
(4, 707, '2019-06-18', '10:54:31', '10:57:37'),
(5, 707, '2019-06-18', '11:03:59', '11:10:39'),
(6, 707, '2019-06-18', '11:04:41', '11:08:02'),
(7, 707, '2019-06-18', '11:11:04', '11:19:39');
id ctn day f_time l_time
1 707 2019-06-18 10:48:25 10:56:17
2 707 2019-06-18 10:48:33 10:56:17
3 707 2019-06-18 10:53:17 11:00:49
4 707 2019-06-18 10:54:31 10:57:37
5 707 2019-06-18 11:03:59 11:10:39
6 707 2019-06-18 11:04:41 11:08:02
7 707 2019-06-18 11:11:04 11:19:39
Now I need distinct non-overlapping user sessions, so it should give me:
1. start_time: 10:48:25 end_time: 11:00:49 duration: 12min,24 sec
2. start_time: 11:03:59 end_time: 11:10:39 duration: 6min,40 sec
3. start_time: 11:11:04 end_time: 11:19:39 duration: 8min,35 sec
I used the accepted answer for my own needs until I found a limit:
What if we have these rows ?
( 9, 707, '2019-06-18', '12:15:15', '13:13:13'),
(10, 707, '2019-06-18', '13:04:41', '13:10:02'),
(11, 707, '2019-06-18', '13:12:17', '13:22:22'),
Both queries create a new group starting on row 10 instead of grouping these 3 rows together.
Interval No. | | Interval start | | Interval stop | | Duration
--------------+---------------+----------------+-------------+---------------+-------------+----------
1 | Start time: | 10:48:25 | End time: | 11:00:49 | Duration: | 00:12:24
2 | Start time: | 11:03:59 | End time: | 11:10:39 | Duration: | 00:06:40
3 | Start time: | 11:11:04 | End time: | 11:19:39 | Duration: | 00:08:35
4 | Start time: | 12:15:15 | End time: | 13:13:13 | Duration: | 00:57:58
5 | Start time: | 13:12:17 | End time: | 13:22:22 | Duration: | 00:10:05
6 | Start time: | 14:05:17 | End time: | 14:14:14 | Duration: | 00:08:57
Intervals 4 and 5 are overlapping.
The answer is to tweak the core query
WHEN LAG(c.lt) OVER () > ft THEN 0
needs to be replace by
WHEN max(c.lt) over (order by c.ft, c.lt rows between unbounded preceding and 1 preceding) > ft THEN 0
then you'll get:
Interval No. | | Interval start | | Interval stop | | Duration
--------------+---------------+----------------+-------------+---------------+-------------+----------
1 | Start time: | 10:48:25 | End time: | 11:00:49 | Duration: | 00:12:24
2 | Start time: | 11:03:59 | End time: | 11:10:39 | Duration: | 00:06:40
3 | Start time: | 11:11:04 | End time: | 11:19:39 | Duration: | 00:08:35
4 | Start time: | 12:15:15 | End time: | 13:22:22 | Duration: | 01:07:07
5 | Start time: | 14:05:17 | End time: | 14:14:14 | Duration: | 00:08:57
(5 rows)
Answered by Gilles on October 28, 2021
To solve this problem I did the following:
For this part I slightly added to the table definition provided by the OP. I'm firmly of the belief that DDL should be used to the maximum extent possible to "guide" the whole database programming process and could be much more powerful - an example of this would be SQL in CHECK
constraints - so far only provided by Firebird (example here) and H2 (see reference here).
However, that's all very well and good, but we have to deal with PostgreSQL's 9.6 capabilities - the OP's version. My adjusted DDL for the "simple" explanation (see the entire fiddle here):
CREATE TABLE sessions
(
id serial NOT NULL PRIMARY KEY,
ctn INT NOT NULL,
f_day DATE NOT NULL,
f_time TIME(0) NOT NULL,
l_time TIME(0) NOT NULL,
CONSTRAINT ft_less_than_lt_ck CHECK (f_time < l_time),
CONSTRAINT ctn_f_day_f_time_uq UNIQUE (ctn, f_day, f_time),
CONSTRAINT ctn_f_day_l_time_uq UNIQUE (ctn, f_day, l_time)
-- could put in a DISTINCT somewhere if you don't have these constraints
-- maybe has TIME(2) - but see complex solution
);
Indexes:
CREATE INDEX ctn_ix ON sessions USING BTREE (ctn ASC);
CREATE INDEX f_day_ix ON sessions USING BTREE (f_day ASC);
CREATE INDEX f_time_ix ON sessions USING BTREE (f_time ASC);
Just a point to note: do not use SQL keywords as table or column names - day
is such a keyword! It can be confusing for debugging &c - it's simply not good practice. I've changed your original fieldname of day
to f_day
- note all lower and python case! Whatever you do, have a standard method of naming variables and stick to it - there are many coding standards documents out there.
The change to 'f_day' has no effect on the rest of the SQL as we don't take account of sessions which span midnight. Taking account of these can be done relatively easily by doing to the following (see fiddle).
SELECT (f_day + f_time)::TIMESTAMP FROM sessions;
Now the advent of GENERATED
columns, you don't even have to worry about this - just have a GENERATED
field as above!
If a constraint to the second is unworkable - logins at the same time, you could possibly use TIME(2) (or 3..6)
for ensuring uniqueness. If [you don't want | can't have] UNIQUE
constraints, you could put in DISTINCT
in your SQL for identical login and logout times - though this is unlikely.
The fact remains that some simple DDL like this enormously simplifies your subsequent SQL (see discussion at the end of the "complex" explanation below).
You might also want to put ctn
and/or day
in your DDL UNIQUE
constraints as shown? I've also added what I think may be good indexes! You might also want to investigate the OVERLAPS
operator?
As for the sample data, I also added a few records for testing my solution as follows:
INSERT INTO sessions (id, ctn, day, f_time, l_time)
VALUES
( 1, 707, '2019-06-18', '10:48:25', '10:56:17'),
( 2, 707, '2019-06-18', '10:53:17', '11:00:49'),
( 3, 707, '2019-06-18', '10:54:31', '10:59:43'), -- record 3 is completely covered
-- by record 2
( 4, 707, '2019-06-18', '11:03:59', '11:10:39'),
( 5, 707, '2019-06-18', '11:04:41', '11:08:02'), -- GROUP 2 record 6 completely
-- covers record 7
( 6, 707, '2019-06-18', '11:11:04', '11:19:39'), -- GROUP 3
( 7, 707, '2019-06-18', '12:15:15', '13:13:13'),
( 8, 707, '2019-06-18', '13:04:41', '13:20:02'),
( 9, 707, '2019-06-18', '13:17:17', '13:22:22'), -- GROUP 4
(10, 707, '2019-06-18', '14:05:17', '14:14:14'); -- GROUP 5
I'll go through my logic step by step - good for you perhaps, but also for me as it helps me to clarify my thinking and will ensure that the lessons which I have learnt from this exercise will stay with me - "I hear and I forget. I see and I remember. I do and I understand." - Confucius.
All of the following is included in the fiddle.
/**
So, the desired result is:
Interval 1 - start: 10:48:25 - end 11:00:49
Interval 2 - start: 11:03:59 - end 11:10:39
Interval 3 - start: 11:11:04 - end 11:19:39
Interval 4 - start: 12:15:15 - end 13:22:22
Interval 5 - start: 14:05:17 - end 14:14:14
**/
The first step is to use the LAG
function (documentation) as follows:
SELECT
s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt,
CASE
WHEN LAG(s.l_time) OVER () > f_time THEN 0
ELSE 1
END AS ovl
FROM sessions s
Result:
id ctn ft lt ovl
1 707 10:48:25 10:56:17 1
2 707 10:53:17 11:00:49 0
3 707 10:54:31 10:59:43 0
4 707 11:03:59 11:10:39 1
5 707 11:04:41 11:08:02 0
6 707 11:11:04 11:19:39 1
7 707 12:15:15 13:13:13 1
8 707 13:04:41 13:20:02 0
9 707 13:17:17 13:22:22 0
10 707 14:05:17 14:14:14 1
So, whenever there's a new interval, there's a 1 in the ovl
(overlap) column.
Next, we take the cumulative SUM
of those 1's as follows:
SELECT
t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS s
FROM
(
SELECT
s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt,
CASE
WHEN LAG(s.l_time) OVER () > f_time THEN 0
ELSE 1
END AS ovl
FROM sessions s
) AS t1
ORDER BY lt, id
Result:
id ctn ft lt ovl s
1 707 10:48:25 10:56:17 1 1
3 707 10:54:31 10:59:43 0 1
2 707 10:53:17 11:00:49 0 1
5 707 11:04:41 11:08:02 0 2
4 707 11:03:59 11:10:39 1 2
6 707 11:11:04 11:19:39 1 3
7 707 12:15:15 13:13:13 1 4
8 707 13:04:41 13:20:02 0 4
9 707 13:17:17 13:22:22 0 4
10 707 14:05:17 14:14:14 1 5
So, we have now "split out", and have a way of distinguishing between, our intervals - each interval has a different value of s
- 1..5.
So, now we want to get the lowest value of f_time
and the highest value of l_time
for these intervals. My first attempt using MAX()
and MIN()
went as follows:
SELECT
ROW_NUMBER() OVER (PARTITION BY s) AS rn,
MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt) AS min_f,
MAX(lt) OVER (PARTITION BY s ORDER BY ft, lt) AS max_l,
s
FROM
(
SELECT
t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS s
FROM
(
SELECT
s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt,
CASE
WHEN LAG(s.l_time) OVER () > f_time THEN 0
ELSE 1
END AS ovl
FROM sessions s
) AS t1;
ORDER BY id, lt
)AS t2
ORDER BY s, rn ASC, min_f;
Result:
rn min_f max_l s
1 10:48:25 10:56:17 1
2 10:48:25 11:00:49 1
3 10:48:25 11:00:49 1
1 11:03:59 11:10:39 2
2 11:03:59 11:10:39 2
1 11:11:04 11:19:39 3
1 12:15:15 13:13:13 4
2 12:15:15 13:20:02 4
3 12:15:15 13:22:22 4
1 14:05:17 14:14:14 5
Note how we have to get rn
= 3 for the first interval, rn
= 3 for the fourth and different values of rn
for different intervals - if there were 7 sub-intervals making up one interval, then we would have to retrieve rn
= 7 - this stumped me for a while!
Then the power of Window functions came to the rescue - if you sort the MAX()
and MIN()
differently, the correct result drops into our lap:
SELECT
ROW_NUMBER() OVER (PARTITION BY s) AS rn,
MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f,
MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
s
FROM
(
SELECT
t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS s
FROM
(
SELECT
s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt,
CASE
WHEN LAG(s.l_time) OVER () > f_time THEN 0
ELSE 1
END AS ovl
FROM sessions s
) AS t1
ORDER BY id, lt
)AS t2
ORDER BY s, rn ASC, min_f;
Result:
rn min_f max_l s
1 10:48:25 11:00:49 1
2 10:48:25 11:00:49 1
3 10:48:25 10:59:43 1
1 11:03:59 11:10:39 2
2 11:03:59 11:08:02 2
1 11:11:04 11:19:39 3
1 12:15:15 13:22:22 4
2 12:15:15 13:22:22 4
3 12:15:15 13:22:22 4
1 14:05:17 14:14:14 5
Note that now, rn
= 1 is always our desired record - this is the result of:
MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f,
MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
Note that for MIN()
, the sort is by lt DESC
and for the MAX()
(partitioned by interval - i.e. s
) it is by ft DESC
. This matches up the smallest ft
with the largest lt
which is what we want.
This is essentially our desired result - just add a bit of tidying up and formatting as per the OP's requirements and we're good to go. This part also demonstrates the use of another very useful Window function - ROW_NUMBER()
.
SELECT
ROW_NUMBER() OVER () AS "Interval No.",
' Start time: ' AS " ",
t3.min_f AS "Interval start" ,
' End time: ' AS " ",
t3.max_l AS "Interval stop",
' Duration: ' AS " ",
(t3.max_l - t3.min_f) AS "Duration"
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY s) AS rn,
MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f,
MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
s
FROM
(
SELECT
t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS s
FROM
(
SELECT
s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt,
CASE
WHEN LAG(s.l_time) OVER () > f_time THEN 0
ELSE 1
END AS ovl
FROM sessions s
) AS t1
ORDER BY id, lt
)AS t2
ORDER BY s, rn ASC, min_f
) AS t3
WHERE t3.rn = 1;
Final result:
Interval No. Interval start Interval stop Duration
1 Start time: 10:48:25 End time: 11:00:49 Duration: 00:12:24
2 Start time: 11:03:59 End time: 11:10:39 Duration: 00:06:40
3 Start time: 11:11:04 End time: 11:19:39 Duration: 00:08:35
4 Start time: 12:15:15 End time: 13:22:22 Duration: 01:07:07
5 Start time: 14:05:17 End time: 14:14:14 Duration: 00:08:57
I can make no guarantees about the performance of this query if there is a large number of records, see the result of EXPLAIN (ANALYZE, BUFFERS)
at the end of the fiddle. However, I'm assuming that since it's in a report style format, it might be for a given value of ctn
and/or day
- i.e. not too many records?
I won't show every step - after eliminating the duplicate f_time
s and l_time
s, the steps are identical.
Here, the table definition and data are slightly different (fiddle available here):
CREATE TABLE sessions
(
id serial NOT NULL PRIMARY KEY,
ctn INT NOT NULL,
f_day DATE NOT NULL,
f_time TIME(0) NOT NULL,
l_time TIME(0) NOT NULL,
CONSTRAINT ft_lt_lt CHECK (f_time < l_time),
-- CONSTRAINT ft_uq UNIQUE (f_time),
-- CONSTRAINT lt_uq UNIQUE (l_time)
CONSTRAINT ft_lt_uq UNIQUE(f_time, l_time)
-- could put in a DISTINCT somewhere to counter this possibility or
-- maybe have TIME(2) to ensure no duplicates?
);
The only constraints that I've kept are CHECK (f_time < l_time)
(couldn't be any other way) and UNIQUE f_time, l_time
(maybe add day
and/or ctn
to that - the advice about TIME(2) or (3...6)
above also applies.
I'll leave it up to the reader to apply UNIQUE
to combinations of ctn
and f_day
as applicable!
INSERT INTO sessions (id, ctn, day, f_time, l_time)
VALUES
( 1, 707, '2019-06-18', '10:48:25', '10:56:17'), -- note - same l_times
( 2, 707, '2019-06-18', '10:48:33', '10:56:17'), -- need one with lowest f_time
( 3, 707, '2019-06-18', '10:53:17', '11:00:49'),
( 4, 707, '2019-06-18', '10:54:31', '10:59:43'), -- note - same f_times
-- need one with greatest l_time
( 5, 707, '2019-06-18', '10:54:31', '10:57:37'), -- GROUP 1
( 6, 707, '2019-06-18', '11:03:59', '11:10:39'),
( 7, 707, '2019-06-18', '11:04:41', '11:08:02'), -- GROUP 2, record 6 completely
-- covers record 7
( 8, 707, '2019-06-18', '11:11:04', '11:19:39'), -- GROUP 3
( 9, 707, '2019-06-18', '12:15:15', '13:13:13'),
(10, 707, '2019-06-18', '13:04:41', '13:20:02'),
(11, 707, '2019-06-18', '13:17:17', '13:22:22'), -- GROUP 4
(12, 707, '2019-06-18', '14:05:17', '14:14:14'); -- GROUP 5
I've added a couple of potentially "troublesome" records (2 & 4) with the same f_time
and l_time
within the same interval. So, in the event of an identical f_time
, we want the sub-interval with the greatest l_time
and vice-versa for the case of an identical l_time
(i.e. smallest f_time
).
So, what I did in this case was to eliminate duplicates by chaining CTE
's (aka the WITH
clause) as follows:
WITH cte1 AS
(
SELECT s.*, t.mt, t.lt
FROM sessions s
JOIN
(
SELECT
DISTINCT
ctn,
MIN(f_time) AS mt,
l_time AS lt
FROM sessions
GROUP BY ctn, l_time
ORDER BY l_time
) AS t
ON (s.ctn, s.f_time, s.l_time) = (t.ctn, t.mt, t.lt)
ORDER BY s.l_time
),
cte2 AS
(
SELECT
DISTINCT
ctn,
f_time AS ft,
MAX(lt) AS lt
FROM cte1
GROUP BY ctn, f_time
ORDER BY f_time
)
SELECT * FROM cte2
ORDER BY ft;
Result:
ctn ft lt
707 10:48:25 10:56:17
707 10:53:17 11:00:49
707 10:54:31 10:59:43
707 11:03:59 11:10:39
707 11:04:41 11:08:02
707 11:11:04 11:19:39
707 12:15:15 13:13:13
707 13:04:41 13:20:02
707 13:17:17 13:22:22
707 14:05:17 14:14:14
And then I treat cte2
as the starting point for the process in the "easy" explanation.
The final SQL is as follows:
WITH cte1 AS
(
SELECT s.*, t.mt, t.lt
FROM sessions s
JOIN
(
SELECT
DISTINCT
ctn,
MIN(f_time) AS mt,
l_time AS lt
FROM sessions
GROUP BY ctn, l_time
ORDER BY l_time
) AS t
ON (s.ctn, s.f_time, s.l_time) = (t.ctn, t.mt, t.lt)
ORDER BY s.l_time
),
cte2 AS
(
SELECT
DISTINCT
ctn,
f_time AS ft,
MAX(lt) AS lt
FROM cte1
GROUP BY ctn, f_time
ORDER BY f_time
)
SELECT
ROW_NUMBER() OVER () AS "Interval No.",
' Start time: ' AS " ",
t3.min_f AS "Interval start" ,
' End time: ' AS " ",
t3.max_l AS "Interval stop",
' Duration: ' AS " ",
(t3.max_l - t3.min_f) AS "Duration"
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY s) AS rn,
MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f,
MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
s
FROM
(
SELECT
t1.ctn, t1.ft, t1.lt, t1.ovl,
SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS s
FROM
(
SELECT
c.ctn AS ctn, c.ft AS ft, c.lt AS lt,
CASE
WHEN LAG(c.lt) OVER () > ft THEN 0
ELSE 1
END AS ovl
FROM cte2 c
) AS t1
ORDER BY t1.lt
) AS t2
ORDER BY s, rn ASC, min_f
) AS t3
WHERE t3.rn = 1
ORDER BY t3.rn;
Result:
Interval No. Interval start Interval stop Duration
1 Start time: 10:48:25 End time: 11:00:49 Duration: 00:12:24
2 Start time: 11:03:59 End time: 11:08:02 Duration: 00:04:03
3 Start time: 11:11:04 End time: 11:19:39 Duration: 00:08:35
4 Start time: 12:15:15 End time: 13:22:22 Duration: 01:07:07
5 Start time: 14:05:17 End time: 14:14:14 Duration: 00:08:57
As you can see, it's pretty hairy stuff - not having the UNIQUE
constraints in the DDL has doubled the length of the SQL and the time taken for the planning and executions stages, and made it pretty horrible into the bargain.
See the end of the fiddle for the plans for both queries! Lessons to be learnt there! As a rule of thumb, the longer the plan, the slower the query!
I'm not sure that the indexes can play any role here since we're selecting from the entire table and it's very small! If we were filtering a large table by ctn
and/or f_day
and/or f_time
, I'm pretty sure that we would start to see differences in the plans (and timings!) if there were no indexes!
Answered by Vérace on October 28, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP