Database Administrators Asked by Mac Portter on January 6, 2022
I need to create a system where the user creates dynamic filters based on our customer’s attributes. There are more or less 30 possible filters and 30 millions of customers, but the number of customer increase every day and the attribute value can change every day too, so we have insert and updates in these set of data every day. Another thing is that I can create a new filter or remove.
In this case we can use a relational database like Oracle and create a index for every colunm, but with inserts and updates every day, can I have a problem with performance? Should I use a search engine for this case like Elasticsearch? Or there is a recommended database or architecture for this use case?
I need to return a count of customers that match these filters at most in 5 seconds.
EDIT
Some attributes:
The user can use >, <, =, >=, <=
to filter or use IN
, like city IN ('New York', 'Seattle')
(If you have deep enough pockets for Oracle and can handle the audit/licensing requirements)
Create the big ugly wide table, but then use Oracle's in-memory column store to speed your analytic queries. The high level view is each column is split, dedicated to memory (at least partially), and data compression and some other storage tricks are used to speed lookups.
This works great for the type of ad-hoc workflow you describe and doesn't require much thought about table design other than choosing the correct data types (all else being equal, smaller = faster).
As always, test thoroughly as vendor's claims are often inflated. But the engineering is sound and others have found column stores to be very beneficial.
(With caveats)
SQL Server has a similar solution, the Clustered Columnstore Index, but there are impacts to update
performance that may make it unsuitable to your workflow. If you have a wide window during which you can perform your updates, it may work well. Worth a try.
(If you don't have deep pockets or clustered columnstore doesn't work)
Recreate, in part, the same idea as a columnstore but in a typical database table. Some refer to this as sixth normal form (a timestamp is not required for the definition), others vertical partitioning. I just tend to think of it as don't read more data than you have to.
You will need an RDMS that has both clustered indexes (index-organized tables) and join elimination. Right now that's SQL Server, Oracle, Sybase, and DB2.
The idea is you have your Customer
table with a very basic amount of data stored with it. For every attribute, you will create a table Customer_<attribute>
that contains the key from Customer
and the column(s)* associated with that attribute. If that attribute is NULL
you will not insert a row into that table.**
This seems like a lot of work, but if you have a list of columns and their data types, you can write a script to generate the table creation as well as the required insert/update procedures.
You can also write a script to generate a view with all of your columns (for simplicity's sake you can just LEFT JOIN
all the attributes back to the Customer
table). The view will look like a big ugly table, but with join eliminate it will behave much differently.
So if you search for customers on three attributes, the query engine will look at the request, realize it only needs to hit three tables in addition to Customer
and off you go.
Some tables will benefit from secondary indexes and you can add them as necessary and if they improve query performance (they don't always).
This solution also has the added benefit of making updates/inserts fairly quick and without requiring the all the data to undergo a lock.
The biggest downside to this solution is you'll get some "expert" who will walk in, proclaim "joins are bad!" and you'll have to patiently explain to them why they are not.
*If one of the attributes is comprised of a foreign key reference to a composite key, you would not split them apart. Additionally, if those columns will be used together to search for a customer you would include them.
**This pattern requires NULL
/NOT NULL
be enforced through your update/delete processes.
(Really, please do not do these)
Entity-attribute-value (EAV). Looks simple. In practice it's a nightmare, join logic is convoluted, data/relational integrity is impossible to maintain and table locks are disastrous.
Big, ugly, wide table with an index on every column. Wastes a lot of space, performance isn't great.
Big Data/Whatever is hot this week. No/limited indexes. No clear access path to the data. Data/relational integrity may vary from weak to non-existent.
Answered by bbaird on January 6, 2022
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP