Cross Validated Asked by user3676846 on February 11, 2021
I am working on movielens 100K dataset. The idea is to use ML algorithms such as neural nets,SVMs,K-means etc for classification of movies as being rated 1,2,3,4,5. The problem I am facing is to determine which features to use as the features of the data.Since I am taking the collaborative filtering approach, I am using just the u.data dataset, which is 100K tuples of the form
user| movie | rating
I want a way to extract meaningful features from this data so that I could use approaches like neural nets and SVM on those features.
So I did some research on what exactly the dataset provides:
The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC
and
u.item -- Information about the items (movies); this is a tab separated list of movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set.
So you have to choose which features you want to use, but this depends on the type of network you want to have (recurrent or feed forward). You should give more information on how you want to let your network operate.
Personally would use a recurrent network of some kind (LSTM mixed with fully connected layers). I would provide a dataset where users should be trained one-by-one, during training the context of the network should be clear between every user.
Features I would select:
Note that user id is missing, as you want to make this network work with any given user. I chose not to include the timestamp of the rating, as I think feeding the ratings one-by-one in (time)-chronological order would incoperate this.
Answered by Thomas Wagenaar on February 11, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP