How to find the most common strings between rows

Question

In SQL 2017, I have a 'food' table as follows:
ID    item    Ingredients
----  -----   -----------
1     item1   flour,sugar,cocoa,butter
2     item2   flour,sugar,butter,water
3     item3   sugar,cocoa,water
4     item4   sugar,cocoa,butter
5     item5   flour,water

Although the "Ingredients" are a CSV in a single field, I have no problem moving them to a related table:
ID    Ingredient
---   ----------
 1    flour
 1    sugar
 1    cocoa
 1    butter
 2    flour

etc
or just select them with a CROSS APPLY string_split
select ID, item, ingredient
from food
CROSS APPLY string_split(food.ingredients,',') as ing

Now here is what I am trying to attempt:
For any given row ex:(where id = 1) I want to return the top n rows (other than id = 1) that have the most common ingredients with id = 1.
So the top 2 with the most common ingredients with id = 1 should be:
ID    item    Ingredients
----  -----   -----------
2     item2   flour,sugar,butter,water    ( 3 in common)
3     item3   sugar,cocoa,water           ( 2 in common)
4     item4   sugar,cocoa,butter          ( 3 in common)
5     item5   flour,water                 ( 1 in common)

The result should return rows with id = 2 and id = 4 as both have 3 ingredients in common with row id = 1
This is my first question on StackExchange, and I hope it is clear.
I can write this as a C# or Python batch program to create a table of 'closest ingredients', but I'm hoping I can do this in SQL.
Thanks !

bbaird · Answer

Please store the ingredients in a table not a list.  So let's call that ItemIngredient with two columns ItemId and Ingredient and leave it at that.  The primary key will be (ItemId, Ingredient).  Because searches by Ingredient will probably happen a lot, let's put a non-unique index on Ingredient (which is necessary for below to work the most efficiently).
SELECT
  ItemId
 ,MatchItemId
 ,IngredientMatchCount
 ,MatchRank
FROM
  (
    SELECT
      Base.ItemId
     ,Match.ItemId AS MatchItemId
     ,COUNT(*) AS IngredientMatchCount
     ,RANK() OVER (PARTITION BY Base.ItemId ORDER BY COUNT(*) DESC, Match.ItemId) AS MatchRank
    FROM
      ItemIngredient Base
    INNER JOIN
      ItemIngredient Match
        ON Match.Ingredient = Base.Ingredient 
             AND Match.ItemId <> Base.ItemId
    -- Put your WHERE clause here if you want to only fine matches for certain Ids
    GROUP BY
      Base.ItemId
     ,Match.ItemId
  ) matched
WHERE
  MatchRank <= 1 --Or whatever
ORDER BY
  ItemId

You can experiment with RANK vs DENSE_RANK at some point to see what provides the best results for you.  If you need details about the Item you can join back in the outer query.

How to find the most common strings between rows

One Answer

Add your own answers!

Ask a Question