Euclidean distance of all pandas rows to single row

Question

I have a dataset that gives the values of some songs, ie something that looks like:
    acousticness danceability energy instrumentalness key  liveness  loudness 
0        0.223      0.780      0.72       0.111        1     0.422    0.231
1        0.4        0.644      0.88       0.555        0.5   0.66     0.555
2        0.5        0.223      0.145      0.76         0     0.144    0.567
.
.
.

I want to find the songs/ rows that are numerically closest to another song, such as song 0, using the euclidean distance.So I'd like to obtain something like:
    acousticness danceability energy instrumentalness key  liveness  loudness Euclidean to song 0
0        0.223      0.780      0.72       0.111        1     0.422    0.231       0
1        0.4        0.644      0.88       0.555        0.5   0.66     0.555      1.334
2        0.5        0.223      0.145      0.76         0     0.144    0.567     1.442
.
.
.

yatu · Answer

The usual procedure for what you're trying to do, is to use one of sklearn's pairwise metrics, such as the cosine_similarity, and build a similarity matrix with it:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

cosine_similarity(df)
array([[1.        , 0.86597679, 0.38431913],
       [0.86597679, 1.        , 0.71838491],
       [0.38431913, 0.71838491, 1.        ]])

This gives you a square matrix with the indices representing the dataframe song  index.

Similarity with a single item
If you're only interested in the similarities with a specific song, say song 0, you can specify a second a array as, so that the similarities are obtained using all items in the input matrix with a given item.
Since you mentioned the euclidean distance, here's one using sklearn's euclidean_distances. Note that we have tu subtract the result from 1, since we have distances. If we want the actual distance, we can just keep the resulting array:
1-euclidean_distances(df, df.to_numpy()[0,None])
array([[ 1.        ],
       [-0.16977006],
       [-1.15823261]])

For the distance, just:
euclidean_distances(df, df.to_numpy()[0,None])
array([[0.        ],
       [1.43266989],
       [2.64328432]])

To update as a new column:
df['Similarity with song 0'] = 1-euclidean_distances(df, df.to_numpy()[0,None]).squeeze()

print(df)

acousticness  danceability  energy  instrumentalness  key  liveness  
0         0.223         0.780   0.720             0.111  1.0     0.422   
1         0.400         0.644   0.880             0.555  0.5     0.660   
2         0.500         0.223   0.145             0.760  0.0     0.144

loudness  Similarity with song 0  
0     0.231                1.000000  
1     0.555               -0.169770  
2     0.567               -1.158233

Euclidean distance of all pandas rows to single row

One Answer

Add your own answers!

Ask a Question