Data Science Asked on June 19, 2021
Say there are 2 users, A and B, and they each shared 10 images (in some social media site), which I have collected in 2 folders separately.
I want to calculate the similarity between the 2 users based on all the images they shared. For example, there is a sense of the 2 users being more similar if they both share pictures of food; and the 2 users would be dissimilar if one of them shares pictures of food while the other posts selfies. And the 2 users should be even more similar if they both share pictures of food containing chicken dishes only. And so on.
How do I calculate this similarity? I thought of adding up the similarities of each pair of images, and then averaging them. But this clearly doesn’t work, because take only 1 user for instance. If you calculate this user’s similarity to itself, the similarity will not be 1. But we expect the similarity of something with itself to be 1.
So, how do we measure this similarity of users based on their images shared?
EDIT: I am getting the following similarities in my actual use case:
import time
img_path = 'Images/All_Users/'
user_list = ['User '+str(i) for i in range(1,11)]
df = pd.DataFrame(columns=user_list, index=user_list)
for user1 in user_list:
for user2 in user_list:
t = time.time()
sum_img_sim = 0
user1_files = [x for x in os.listdir(img_path+user1+'/') if "jpg" in x]
user2_files = [x for x in os.listdir(img_path+user2+'/') if "jpg" in x]
for image1 in user1_files:
for image2 in user2_files:
sim = image_similarity(img_path+user1+'/'+image1, img_path+user2+'/'+image2)
sum_img_sim += sim
df[user1][user2] = sum_img_sim/(len(user1_files)*len(user2_files))
print(user1 +' and '+ user2 + ' done in ' + str(time.time()-t) + ' secs. Similarity: ' + str(df[user1][user2]))
So, as you can see, there’s not much difference in the similarities, and I’m getting anomalies like User 2 has a similarity of 0.47 with itself, while User 1 and User 10 have a similarity of 0.49
Use a similarity vector with the dimension of all possible similarities, which initially are set to zero everywhere. Go over the shared images of one user and for each image set the vector to 1 at the positions of the similarities from the image.
Do the same for another user.
Use the scalar product as similarity metric between the two users.
You can add weights for similarities which appear more than 1 time on the shared images.
Answered by Eugen on June 19, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP