Cosine similarity with arrays contaning NaN

Question

I am trying to calculate a cosine similarity using Python in order to find similar users basing on ratings they have given to movies. As it can be expected there are a lot of NaN values.
I am using movie dataset from Kaggle.

When I use np.dot() on two nd.arrays the outcome is:'nan'.
I have checked with np.nansum() that there are some other than 'nan' values.

I do not want to change all 'nan' values to '0' as it would mean that users have given 0 rating to the movies which lead to 'false' similarity between users.

Please, give me some advice regarding how to proceed with this problem.

Thanks in advance.

王耀东 · Answer

def cosine_sim(df1, df2):

df1na = df1.isna()
    df1clean = df1[~df1na]
    df2clean = df2[~df1na]

df2na = df2clean.isna()
    df1clean = df1clean[~df2na]
    df2clean = df2clean[~df2na]

# Compute cosine similarity
    distance = cosine(df1clean, df2clean)
    sim = 1 - distance

return sim

fuwiak · Answer

Have you couple of types na's handling. Its sample of code:

def na_handling(df, name_of_strategy):

#list of stategies -> mean, mode, 0, spefic_value, next_row, previous_row

if name_of_strategy=="previous_row":
            df.fillna(method="backfill", inplace=True)
            return df
        elif name_of_strategy=="next_row":
            df.fillna(method="ffill", inplace=True)
            return df
        elif name_of_strategy=="0":
            df.fillna(0, inplace=True)
            return df

elif name_of_strategy=="mean":
            df.fillna(df.mean(), inplace=True)
            return df
        elif name_of_strategy=="mode":
            df.fillna(df.mode(), inplace=True)
            return df
        else:
            print("Wrong specified strategy")

vec1 = na_handling(old_vec, "next_row")

Sean Owen · Answer

I think it's rarely meaningful to consider cosine similarity on sparse data like this, not just because of sparsity (because it's only defined for dense data), but because it's not obvious the cosine similarity is meaningful. For example a user that rates 10 movies all 5s has perfect similarity with a user that rates those 10 all as 1. Magnitude doesn't matter in cosine similarity, but it matters in your domain.

It's much more likely that it's meaningful on some dense embedding of users and items, such as what you get from ALS.

To answer the question, you either need to impute the missing ratings (don't assume 0, but a mean value or similar), or ignore dimensions that aren't defined in both.

Cosine similarity with arrays contaning NaN

3 Answers

Add your own answers!

Ask a Question