Data Science Asked by user641597 on January 13, 2021
I am trying to calculate a cosine similarity using Python in order to find similar users basing on ratings they have given to movies. As it can be expected there are a lot of NaN values.
I am using movie dataset from Kaggle.
When I use np.dot() on two nd.arrays the outcome is:’nan’.
I have checked with np.nansum() that there are some other than ‘nan’ values.
I do not want to change all ‘nan’ values to ‘0’ as it would mean that users have given 0 rating to the movies which lead to ‘false’ similarity between users.
Please, give me some advice regarding how to proceed with this problem.
Thanks in advance.
def cosine_sim(df1, df2):
df1na = df1.isna()
df1clean = df1[~df1na]
df2clean = df2[~df1na]
df2na = df2clean.isna()
df1clean = df1clean[~df2na]
df2clean = df2clean[~df2na]
# Compute cosine similarity
distance = cosine(df1clean, df2clean)
sim = 1 - distance
return sim
Answered by 王耀东 on January 13, 2021
Have you couple of types na's handling. Its sample of code:
def na_handling(df, name_of_strategy):
#list of stategies -> mean, mode, 0, spefic_value, next_row, previous_row
if name_of_strategy=="previous_row":
df.fillna(method="backfill", inplace=True)
return df
elif name_of_strategy=="next_row":
df.fillna(method="ffill", inplace=True)
return df
elif name_of_strategy=="0":
df.fillna(0, inplace=True)
return df
elif name_of_strategy=="mean":
df.fillna(df.mean(), inplace=True)
return df
elif name_of_strategy=="mode":
df.fillna(df.mode(), inplace=True)
return df
else:
print("Wrong specified strategy")
vec1 = na_handling(old_vec, "next_row")
Answered by fuwiak on January 13, 2021
I think it's rarely meaningful to consider cosine similarity on sparse data like this, not just because of sparsity (because it's only defined for dense data), but because it's not obvious the cosine similarity is meaningful. For example a user that rates 10 movies all 5s has perfect similarity with a user that rates those 10 all as 1. Magnitude doesn't matter in cosine similarity, but it matters in your domain.
It's much more likely that it's meaningful on some dense embedding of users and items, such as what you get from ALS.
To answer the question, you either need to impute the missing ratings (don't assume 0, but a mean value or similar), or ignore dimensions that aren't defined in both.
Answered by Sean Owen on January 13, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP