Database Administrators Asked by Deepan Kaviarasu on October 28, 2021
I am stuck with a query:
CREATE TABLE public.bulk_sample (
serial_number character varying(255),
validation_date timestamp, -- timestamp of entry and exit
station_id integer,
direction integer -- 1 = Entry | 2 = Exit
);
INSERT INTO public.bulk_sample VALUES
('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:31:58', 120, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:50:22', 113, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:16:56', 113, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:47:06', 120, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:02:12', 120, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:47:45', 102, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 19:26:38', 102, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 20:17:24', 120, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 07:58:20', 119, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 08:43:35', 104, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 16:38:10', 104, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:15:01', 119, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:42:29', 119, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:48:05', 120, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:17:59', 120, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:25:25', 118, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:16:12', 118, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:32:51', 120, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:31:20', 120, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:39:33', 118, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 20:57:50', 118, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 21:16:25', 120, 2)
;
I have to create a query which gives a result as follows
source | dest | Count
120 | 113 | 1
113 | 120 | 1
I tried the following code but not able to get the desired result:
SELECT serial_number
, count(*)
, min(validation_date) AS start_time
, CASE WHEN count(*) > 1 THEN max(validation_date) END AS end_time
FROM (
SELECT serial_number, validation_date, count(step OR NULL) OVER (ORDER BY serial_number,
validation_date) AS grp
FROM (
SELECT *
, lag(validation_date) OVER (PARTITION BY serial_number ORDER BY validation_date)
< validation_date - interval '60 min' AS step
FROM table1
where BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
) sub1
) sub2
GROUP BY serial_number, grp;
The time interval is about 55 mins to 60 mins between every entry and exit.
I have also tried an inner join but not able to group by the time interval in an inner join
SELECT source.station_id AS source_station ,dest.station_id AS destination_station ,source.count FROM
(
SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1
WHERE
direction = 1 AND
validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
GROUP BY serial_number,station_id
)source
INNER JOIN
(
SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1
WHERE
direction = 2 AND
validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
GROUP BY serial_number,station_id
)dest
ON source.serial_number = dest.serial_number and source.station_id <> dest.station_id
The challenge is sometimes there is null in entry date and sometimes there is null in exit date.
This should be simplest and fastest while transactions per serial_number
never overlap:
WITH cte AS (
SELECT serial_number, validation_date, station_id, direction
, row_number() OVER (PARTITION BY serial_number ORDER BY validation_date) AS rn
FROM bulk_sample
WHERE validation_date >= '2020-02-01' -- ①
AND validation_date < '2020-02-02' -- entry & exit must be within time frame
)
SELECT s.station_id AS source, d.station_id AS dest, count(*)
FROM cte s
JOIN cte d USING (serial_number)
WHERE s.direction = 1
AND d.rn = s.rn + 1
GROUP BY 1, 2
ORDER BY 1, 2; -- optional sort order
db<>fiddle here
① I rewrote the WHERE
condition to get all of Feb 1 2020 in optimal fashion. BETWEEN
is almost always the wrong tool for time ranges. See:
Also, '2020-02-01' is a perfectly valid timestamp
constant 00:00:00
is assumed when the time component is missing.
While retrieving results for a given time frame, a plain btree index on (validation_date)
is the optimum. For the complete table, an index on (serial_number, validation_date)
would help more.
validation_date IS NULL
?The query keeps working while only the last destination per serial_number
in the given time frame has validation_date IS NULL
because NULL
values happen to sort last in default ascending order. But it breaks with any other cases of validation_date IS NULL
. You'll have to define more closely where those can pop up and how to deal with them exactly.
uuid
instead of varchar(255)
for serial_number
?Your serial_number
seems to be a hexadecimal number with exactly 64 digits. If so, varchar(255)
is a poor choice. See:
Moreover, a single uuid
(32 hex digits) should suffice. If all 64 hex digits are needed, still consider 2 uuid
columns. Smaller, faster, safer. Consider:
SELECT *
, replace(uuid1::text || uuid2::text, '-', '') AS reverse_engineered
, replace(uuid1::text || uuid2::text, '-', '') = serial_number AS identical
, pg_column_size(serial_number) AS varchar_size
, pg_column_size(uuid1) + pg_column_size(uuid2) AS uuid_size
FROM (
SELECT serial_number
, left(serial_number, 32)::uuid AS uuid1
, right(serial_number, 32)::uuid AS uuid2
FROM bulk_sample
) sub;
db<>fiddle here
See:
Answered by Erwin Brandstetter on October 28, 2021
For this you will need two things:
After that, your query becomes:
SELECT
station_entry.station_id AS source
,station_exit.station_id AS dest
,COUNT(*) AS count
FROM
public.bulk_sample station_entry
INNER JOIN
public.bulk_sample station_exit
ON station_exit.serial_number = station_entry.serial_number
AND station_exit.validation_date =
(
SELECT
MIN(validation_date)
FROM
public.bulk_sample
WHERE
serial_number = station_entry.serial_number
AND validation_date > station_entry.validation_date
)
WHERE
station_entry.direction = 1
AND station_exit.direction = 2 --Ensure next transaction is valid
AND station_entry.validation_date >= '2020-02-01 00:00:00'
AND station_entry.validation_date <= '2020-02-01 23:59:59'
AND station_exit.validation_date <= '2020-02-01 23:59:59' --Ensure both events occurred within specified timeframe
GROUP BY
station_entry.station_id
,station_exit.station_id
Should return:
source dest count
102 120 1
104 119 1
113 120 1
118 120 2
119 104 1
119 120 1
120 102 1
120 113 1
120 118 2
Answered by bbaird on October 28, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP