Database Administrators Asked on December 8, 2021
I have a table that contains a set of measurements for a continuous stream of processes. Although each process is individual, they are categorized into groups. The events have a start and end timestamp and a process group identifier.
The table structure is as follows (InnoDB, MariaDB 10):
Table Name: measurements
CREATE TABLE `measurements` (
`row_id` int(11) NOT NULL AUTO_INCREMENT,
`process_name` varchar(100) COLLATE utf8_bin NOT NULL,
`process_id` int(11) NOT NULL,
`process_group_id` tinyint(4) NOT NULL,
`measurement_1` float NOT NULL,
`measurement_2` float NOT NULL,
`measurement_3` float NOT NULL,
`measurement_4` float NOT NULL,
`start_timestamp` int(11) NOT NULL,
`end_timestamp` int(11) NOT NULL,
PRIMARY KEY (`row_id`),
KEY `process_group_id` (`process_group_id`,
`start_timestamp`,`end_timestamp`),
KEY `process_id` (`process_id`)
) ENGINE=InnoDB
AUTO_INCREMENT=7294932
DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I’m designing a query to obtain the sum of measurements 1,2,3 & 4 for all processes running within a group at a particular point in time so that the app can express each measurement for a specific process as a percentage of the total measurements in the group at that time. The start and end times of processes within a group are not synchronized and they are of variable length.
So for a process running in Group 5, at timestamp 1431388800
SELECT SUM(measurement_1),
SUM(measurement_2),
SUM(measurement_3),
SUM(measurement_4)
FROM measurements
WHERE process_group_id = 5
AND 1431388800
BETWEEN start_timestamp
AND end_timestamp
This query runs, but takes around 0.5s. The table has 8m records and grows by about 30,000 a day.
I have an index on process_group_id, start_timestamp, end_timestamp. However, the query does not appear to use anything but the process_group_id part of the index. I created an additional index on process_group_id alone to check this, and once created EXPLAIN showed it using this index.
After some searching, I saw a suggestion to modify the query and add an ORDER BY clause. Having done this the query is accelerated to around 0.06s and it seems to use the full index. However, I’m unsure as to why:
SELECT process_group_id,
SUM(measurement_1),
SUM(measurement_2),
SUM(measurement_3),
SUM(measurement_4)
FROM measurements
WHERE process_group_id = 5
AND 1431388800
BETWEEN start_timestamp
AND end_timestamp
ORDER BY process_group_id ASC
With 30,000 new records a day that requires their shares to be calculated, 0.06s is still not particularly fast. Is there a better way of structuring either the table or designing the query to get a few orders of magnitude quicker, or is a query which matches on one column and then a range query on two others always going to be fairly slow to run?
(Not an answer, but too clumsy for a comment.)
have an index on process_group_id, start_timestamp, end_timestamp. However the query does not appear to use anything but the process_group_id part of the index.
It is actually using start_timestamp
although it does not say so. What was the key_len
? That may give a clue. Also try EXPLAIN FORMAT=JSON SELECT ...
.
AND 1431388800 BETWEEN start_timestamp AND end_timestamp
Turn that into the following to see if it helps:
AND start_timestamp <= 1431388800
AND end_timestamp >= 1431388800
I suspect it is identical, but I am not sure.
Caution. The difference between 0.5s
and 0.06s
could a warmed up cache. Run timings twice. Also, use SQL_NO_CACHE
to avoid the Query cache.
How wide are the timestamp ranges typically? How precise is 1431388800? The values sound like they have a resolution of 1 second. What if we switched to 1 minute or 1 hour?
After you provide some answers, I will possibly suggest turning this into a Data Warehouse application and discuss Summary tables.
Edit
Consider this approach to storing the data. (I still don't have enough details to determine what variant of the following would be optimal.)
Since you have a "processing" phase that leads to the table in question, I suggest rewriting it to store into a different table (either in place of the existing one, or in addition)
CREATE TABLE ByMinute (
process_group_id ...
ts TIMESTAMP NOT NULL, -- rounded to the minute
sum_1 FLOAT -- see below
sum_2 ...,
PRIMARY KEY(process_group_id, ts)
);
The table contains one row per minute. That's 0.5M per year, not terribly big. If converting from the existing structure do sum_1 += measurement_1
for each row BETWEEN starttime AND endtime
.
That is possibly more processing than you are currently doing, but it should not be excessive. And it makes the SELECT
extremely efficient:
SELECT sum_1, sum_2, sum_3, sum_4
FROM ByMinute
WHERE process_group_id = 5
AND ts = somehow round 1431388800 to a minute
You currently have a daily dump. The processing should be obvious. If you switch to "streaming" and use the ping-pong method I mentioned, then very similar code can be used for each transient table. And you would probably have nearly up-to-the-second data all the time.
Answered by Rick James on December 8, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP