Data Science Asked by Dhaval Thakkar on December 2, 2020
Information Available
Consider, there are ‘n’ users and they have these attributes and values
User A:
Row | Attribute a | Attribute b | Attribute c
Item 1| 0.593 | 0.7852 | 0.484
Item 2| 0.18 | 0.96 | 0.05
Item 3| 0.423 | 0.886 | 0.156
User B:
Row | Attribute a | Attribute b | Attribute c
Item 7| 0.228 | 0.148 | 0.658
Item 8| 0.785 | 0.33 | 0.887
Item 9| 0.569 | 0.994 | 0.374
Items in this dataset can be described using the attributes a, b, and, c. So, the items might or might not be the same for different users but the attributes explain the taste of the user.
Currently, I have data for about 1000 users in this format and I can create a classifier for one user that says whether the user will like the given item or not.
Goal
What I want to do is to match users who have similar tastes using the info available above. I don’t know much about Recommendation Systems and I’d really appreciate if someone could help me out.
One possible approach would be to create N classifiers (one per each user) and then pick M random items, and run those into the N classifiers. The outcome would be something like:
User 1 | User 2 | ... | User N
Item 1: 1 | 0 | ... | 1 --> User 1 and N both like item 1
Item 2: 1 | 1 | ... | 1 --> All users like item 2
... ... | ... | ... | ...
Item M: 0 | 0 | ... | 0 --> No user likes item M
where the i-th row contains the result of running i-th item in all N classifiers, and the j-th column contains the results of running all the M items in the j-th classifier.
Then you could see each user as a M-dimentional point, and use a simple classifier such as KNN with haming distance for the distance metric.
With a larger M, you'd get more accurate results, since you're using more variables to compare each user. The only caveat here is that you'd need those N classifiers to be very accurate, in order to minimize error propagation.
Answered by Fábio Colaço on December 2, 2020
I might be misreading your data, but I assume that the Item # will repeat and that they're not unique to the User. Though, in your example there is no overlap.
If I'm right and the items are finite, then I'd add a column of items x attributes (thats a lot of columns) and it'll create a pretty sparse matrix like this:
+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---+--------------------+--------------------+--------------------+
| | Item 1 Attribute a | Item 1 Attribute b | Item 1 Attribute c | Item 2 Attribute a | Item 2 Attribute b | Item 2 Attribute c | … | Item 9 Attribute a | Item 9 Attribute b | Item 9 Attribute c |
+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---+--------------------+--------------------+--------------------+
| User A | 0.593 | 0.7852 | 0.484 | 0.18 | 0.96 | 0.05 | … | | | |
| User B | | | | | | | … | 0.569 | 0.994 | 0.374 |
+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---+--------------------+--------------------+--------------------+
Then, given a new user and their attributes, you could use cosine similarity to find the nearest rows in this data to this new row. I think that's the key to your problem is that you want to have each row be a user if you're doing user to user similarity.
You could of course look into describing the users with metadata about them, and if you had product attributes you could look into something more complex like Matrix Factorization. I'm no expert, but I'm just trying to point you in the right direction.
A lot of what direction you take will depend on how sparse (how many blanks) occur in this resulting matrix when you make each row a user. Also, too many columns? You could try dimensionality reduction next. Some techniques work better if you have a sparse matrix than others.
Too many users? You could do clustering and assign each user to a cluster. Then, you perform the exercise on clusters rather than individual users in the next step.
There are lots of ways this could do, sorry for not having much specific to say. However I think the key is setting up the data so that each row is a user.
Answered by Josh on December 2, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP