Data Science Asked by G4bri3l on March 23, 2021
I’m currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.
Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).
Which database offers the flexibility to run such a query efficiently ?
I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I’m most likely going to be using Python.
I found this article about doing it in SQL.
EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.
If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.
Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.
I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.
Correct answer by Wes on March 23, 2021
If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.
From Wikipedia,
In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar.
Here is the research paper from Google and here several implementations in Python
Answered by Tasos on March 23, 2021
Answered by I_Play_With_Data on March 23, 2021
If you need to scale beyond 1000 entries in the future, a brute-force approach to find the exact neighbors will become increasingly prohibitive from a computational standpoint. To future-proof your solution, I would recommend looking into the well-researched field of approximate nearest neighbors (ANN) techniques. Obviously there is a speed/accuracy tradeoff, but as of this writing, there is really no other way to scale your search to millions or billions of entries.
The big tech companies rely almost exclusively on techniques like these. Think about...
This article is an excellent overview of current state-of-the-art algorithms along with their pros and cons. Below, I've linked several popular open-source implementations. All 3 have Python bindings
Answered by Addison Klinke on March 23, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP