Neural Network - Sparsity of collaborative based filtering and modelling the prediction problem

Question

I'm fairly new to machine learning and for that matter, neural networks, but for the past couple of days I decided to take a stab at a fairly classical and practical problem of neural networks/machine learning which is recommendation systems.

Apologies if this is an unnecessarily broad question, but I found it hard to read up on resources answering this particular question. My main question is, how do you even model the problem (or what directions/advice is there on how to model it)?

Let $M$ denote the set of all movies you can recommend, each of which have an associated $id$ to them. What exactly would the input to the model be (and the output) and how would you segregate the training data and the observed results? For example, if I have a row of training data (row of size $|M|$ and each entry is a number between $1$ and $10$ that denotes the user's rating, or $0$ if the user has not seen it yet), would I just remove, say half of the user's ratings that are between $1$ and $10$ (so are meaningful) and replace them with $0$ in the training row and then in my observed row (for testing) it would be the entire row with no missing ratings like illustrated below?

$$underbrace{[9  0  0  5  7  0  ...  8  0  10  0]}_{Information} rightarrow underbrace{[0  0  0  0  0  0  ...  8  0  10  0]}_{Training} rightarrow underbrace{[9  0  0  5  7  0  ...  8  0  10  0]}_{Observed} $$

A few naive ways that are obvious are:

Input size = $|M|$, output size = $|M|$

Each input neuron is given the associated user rating for the $i^{th}$ movie. Assume some number of intermediate layers and then we get to the output size, which is $|M|$ output neurons each of which attempt to predict the predicted rating of the user for that movie. This seems weird, because basically almost all of the input neurons will be given $0$ and almost all of the output neurons will try to output $0$ (since it's likely a user has only seen an insignificant portion of all $|M|$ movies). This doesn't seem like the neural network will learn anything useful. And what do I do for movies that the user hasn't seen? Do I just consider this a $0$ loss? So the loss function for neuron $i$ is just:
$$text{Loss}(y^{(i)}) = 0  text{if the user hasn't seen ith movie, otherwise: }  phi{(y^{(i)}, t^{(i)})}$$

Where $phi$ is some loss function of the neuron predicted output $y^{(i)}$ and the observed value (the actual user rating) $t^{(i)}$. I find it difficult to believe that this will yield any useful ratings though..since the loss has no effect on the nonsensical outputs of my neural network (it doesn't affect the $i^th$ neurons' values if the user hasn't seen it yet). This also poses the problem of how a prediction will be made then. Do I just take all the highest predicted values from the neural network and return the respective movies?

Because of the above problems, I tried to express the model differently, but I don't think there is anyway to overcome the sparsity of the data and also the fact that the output size must be $|M|$ as I can't have it try to predict discrete integer movie ids. Could someone please enlighten me on this topic or provide some possible insight to obvious mistakes I am making?

Hadi Gharibi · Answer

You don't have to use full input and output vectors to predict the ratings for a user.

Use the embedding idea and make pairs of (user_id,movie_id) for training. So your input could be like this: (112,1456)=7, it means the user with id number 112 watched the movie number 1456 and the rating was 7.So the size of embedding for the User is  |unique users|  and movies are |unique movies| you call it |M|.

With this trick, we changed the structure of the problem to what we already know. It's regression!

Now you can add all the bias(and whatever you want) to movies and users embeddings. Use the known loss functions for regression like 
RMSE and you all set. For sure you can add other linear layers, relu, and dropouts on top of this embeddings and make it just like fully connected NN and get even better results.

Neural Network - Sparsity of collaborative based filtering and modelling the prediction problem

Input size = $|M|$, output size = $|M|$

One Answer

Add your own answers!

Ask a Question