TransWikia.com

What is the ideal database that allows fast cosine distance?

Data Science Asked by G4bri3l on March 23, 2021

I’m currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.

Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).

Which database offers the flexibility to run such a query efficiently ?

I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I’m most likely going to be using Python.

I found this article about doing it in SQL.

EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.

4 Answers

If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.

Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.

I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.

Correct answer by Wes on March 23, 2021

If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.

From Wikipedia,

In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar.

Here is the research paper from Google and here several implementations in Python

Answered by Tasos on March 23, 2021

  1. Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?
  2. You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.

Answered by I_Play_With_Data on March 23, 2021

If you need to scale beyond 1000 entries in the future, a brute-force approach to find the exact neighbors will become increasingly prohibitive from a computational standpoint. To future-proof your solution, I would recommend looking into the well-researched field of approximate nearest neighbors (ANN) techniques. Obviously there is a speed/accuracy tradeoff, but as of this writing, there is really no other way to scale your search to millions or billions of entries.

The big tech companies rely almost exclusively on techniques like these. Think about...

  • Facebook querying similar faces to suggest people to tag in your photos
  • Spotify recommending semantically similar songs
  • Google searching images similar to one you uploaded

This article is an excellent overview of current state-of-the-art algorithms along with their pros and cons. Below, I've linked several popular open-source implementations. All 3 have Python bindings

Answered by Addison Klinke on March 23, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP