The Task

We are given a set of tuples representing user-movie ratings, \(\mathcal{R} = \{(i,j,r)\}\), where \(i\) is the user index, \(j\) is the movie index and \(r \in [0, \dots, 5]\) the rating. In the most common scenarios, only a fraction of all user-movie pairs are rated, hence we can talk about sparse data. In other words, most users have only rated a few movies. Now we want to obtain sensible predictions for movies a user has not rated. I leave drawing to picture as to why this is super relevant in many use cases to your imagination. Note that we only have this explicit feedback from users and don’t have any context knowledge about the users or movies. We can only leverage ratings from other users to build predictions. This is called collaborative filtering.

Why plain ‘looking at similar users’ won’t work

In order to predict \(\hat{r_{ij}}\), one might come up with the idea of looking for a user \(i’\) with a rating history that was close or similar to that of user \(i\) and see how those users rated movie \(j\). One often refers to this as ‘Neighborhood’ methods. This approach has its limitation due to the sparseness of the data. As most users have only rated ‘few’ movies, it might be that the users who actually have similar preferences to \(i\) have rated completely different movies than \(i\) and therefore will not be detected as similar. In short, the high dimensionality and sparseness of the data makes this approach fairly unpromising. We will overcome this issue by using a lower-dimensional representation, aka embedding, of the data.

Low-hanging fruit: SVD

Given a matrix \(M\), the Singular Value Decomposition gives us a factorization of it: \(M = U \Sigma D^T\). As we are rather looking for a factorization with only two instead of three terms it comes in very handy that \(\Sigma\) is both diagonal and positive. Hence we can write \(M = PQ^T\) with \(P := U \sqrt{\Sigma}, Q := D \sqrt{\Sigma}\). Informally, given that \(M\) holds \(users \times movies\), \(P\) will then hold \(users \times topics\) and \(Q\) \(movies \times topics\). These topics are implicitly chosen by SVD and also referred to as hidden/latent factors, concepts, features or dimensions of the embedding.

Why can we use SVD?

In order to apply SVD, we need \(M\) to come from somewhere. The skeleton of \(M\) will be the given user ratings. For each \((i,j,r) \in \mathcal{R}\) we set \(M[i,j] = r\). Now this makes for a ‘matrix’ full of holes, as most index tuples of the matrix will not have a rating. So not a real matrix. We fix this by imputing \(M\), i.e. we fill the holes by a heuristic and call this new, actually legitimate matrix \(M’\). Filling a hole \((i,j)\) by the average of row \(i\), i.e. the average rating given by user \(i\), typically works somewhat decently. As you can imagine, there are far more elaborate heuristics for this. Yet it remains an imputation, we don’t just get more signal for free.

Why do we use SVD?

We welcome the fact that SVD ranks the features by their importance for reconstruction of the initial data. In addition, given an embedding-dimensionality \(k\), we know that SVD will give us optimal properties on a certain reconstruction loss function.

The idea behind why we desire this intermediate low-dimensional representation in the first place is the assumption that the data is actually generated from a distribution on a low-dimensional space to which a lot of noise is added. This noise is then manifested in a high-dimensional space. This assumption is not to be taken for granted and generally this approach is sometimes considered not very principled. Moving on: said high-dimensional space is \(\{0,1,2,3,4,5, unrated\}^{\#movies}\). What we hope is that eliminating all but the most important hidden factors from SVD will give us this underlying representation and eliminate all the noise. Then from low dimensional representation, i.e. embedding, we map back to the space we originally came from and have more accurate rating estimations, with hopefully less noise.

How do we use SVD?

Instead of factorizing \(M’\) with \(PQ^T\), we now approximate it. We can do that by dropping all but \(k\) of the columns (from the right as they are ordered by ‘importance’) of \(P\) and \(Q\) and calling those matrices \(P_k\) and \(Q_k\), with fewer hidden factors. We obtain: \(M’ \approx P_kQ_k^T\). \(k\) is typically chosen by trial and error as well as thresholding on the importance values. For the sake of concreteness, think of the following orders of magnitude:

\(\#users\) \(\#movies\) \(k\)
\(10^6\) \(10^5\) \(10^1\)

Given such a \(k\) and the matrix \(M’\), SVD will minimize the following loss function for \(P_k\) and \(Q_k\): \[ \mathcal{L} = ||M’ - P_kQ_k^T||_N\] where \(N\) indicates either 2 or Frobenius norm.

Is this it? Said they’d give you anything you ever wanted.

Yes. We obtain a hopefully denoised and enhanced estimate of rating \(r_{ij}\) by the following steps:

  1. SVD embeds both users and movies in lots of hidden factors. We get the user embedding \(p_i \in R^{\#movies}\) and movie embedding \(q_j \in R^{\#users}\).
  2. We cut \(p_i\) to only use its first \(k\) elements/dimensions.
  3. We cut \(q_j\) to only use its first \(k\) elements/dimensions.
  4. We use the dot-product as a similarity measure. Given that both the user and movie are now represented in the same space of hidden factors, our rating is just \(p_i^Tq_j\).

Hence, the overall predictions will be contained in \(P_kQ_k^T\).

Conclusion

SVD beautifully leverages fairly simple Linear Algebra to produce somewhat decent predictions. Given its simplicity, this method works great. Nevertheless there are three substantial shortcomings that come to my mind:

Problem1 SVD successfully minimizes a loss function. Yet said loss function weights the imputed, i.e. made-up, values just as much as the actual ratings. This sounds terrible to me.
Problem2 We might want to have explicit biases for users and movies, involved in the estimation of each rating.
Problem3 No regularization, which could lead to overfitting on a few very dominant dimensions.

Those pain points give rise to defining a new loss function.

Addendum: Iterated SVD

The chosen method of imputation is very impactful, as most values will be imputed. Lovely Jonas Felber recommended to reduce the importance of initial imputation and improve estimate quality by the following process:

M # imputed data matrix
k # chosen dimensionality
for i in range(n_epochs):
  U, S, D = svd(M) # SVD composition, factors ordered according to importance
  P = U * sqrt(S) # User embeddings
  Q = D * sqrt(S) # Movie embeddings
  M = P[:, :k] * Q[:, :k].transpose
  fill_initial_user_ratings_back_into_matrix(M)
return M

Essentially, now we’re using the predictions from a previous iteration as a way of imputing the data in the current iteration, over and over again. My understanding is that the reinsertion of the given ratings reduces the importance of the initial, manual imputation. This approach increased the quality of estimates drastically, in my case.

Not so low hanging fruit

We first define a different loss function and then think of how to minimize it.

Towards a more appropriate loss function

Addressing the issues encountered with SVD, we will no longer be able to guarantee an optimal solution, largely due to the problem not being convex. Still, a local optimum incorporating fixes to those issues is promising to be enhancement over SVD. In a way, I think of sacrificing efficiency for effectiveness.

Problem1 Instead of iterating over all indices, including those of imputed values, we just look at the given ratings. \[ \mathcal{L} = \sum_{(i,j,r) \in R} (r - \hat{r_{ij}})^2\]
Problem2 We can simply add bias terms per user and movie, such that our estimate looks like this: \[\hat{r_{ij}} = p_i^T q_j + bu_i + bm_j \]
Problem3 As we are trying to learn both embeddings and biases of users and movies, all of those should be regularized. \[D = \lambda(||U||_F^2 + ||Q||_F^2 + ||bu||_2^2 + ||bm||_2^2)\]

Hence our overall loss function is \(\mathcal{L_D} = \mathcal{L} + D\)

Stochastic Gradient Descent

Despite the loss function, as well as e.g. \(\frac{\partial \mathcal{L_D}}{\partial p_i}\), iterating over all given ratings, we aim to avoid iterations over the whole dataset per update/iteration to relieve our machines. Our compromise is to only look at one randomly chosen rating \((i,j,r)\) at a time and investigate the loss function: \[\mathcal{L_{i,j}} = (r - \hat{r_{ij}})^2 + \lambda(||p_i||_2^2 + ||q_j||_2^2 + ||bu_i||_2^2 + ||bm_j||_2^2)\] Note that the sum has been substituted by a single sample. Therefore our overall process is to shuffle the training set, iterate over individual datapoints and update embeddings and biases impacted by that datapoint. We repeat this process until ‘convergence’ is reached. Note that this approach still requires initialization of \(P\) and \(Q\) to begin with. I think a fairly conventional approach is to initialize them with random numbers from \([0,1]\). For my purposes, it has worked well to initialize them via aforementioned SVD.

Updates

Note that the loss function is not convex in the tuple \((P, Q)\), hence it we will most likely not minimize it but get stuck in a local minimum. We hope that this local minimum is good enough. Upon investigating rating \((i,j,r)\), updates will be of the following form : \[q_i \leftarrow q_i - \eta \frac{\partial \mathcal{L_{i,j}}}{\partial p_i}\] \[\frac{\partial \mathcal{L_{i,j}}}{\partial p_i} = 2(r - r_{ij})(-q_j) + 2 \lambda p_i\] Where \(\eta\) corresponds to our learning rate, e.g. fixed at \(10^{-3}\) or adapted along the way. What remains to determine is what we update when. I encountered three different approaches towards ordering updates of different variables. Three approaches with increasing performance in my use case.

  1. Alternating optimization:

     for _ in range(n_epochs): # break when convergence is reached
       shuffle(dataset)
       shuffled_half_1, shuffled_half_2 = halve(dataset)
       # This could very well be a different proportion of the dataset, other
       # than 1/2.
       for (i, j, r) in shuffled_half_1:
         update(q_i)
         update(bu_i)
       for (i, j, r) in shuffled_half_2:
         update(p_j)
         update(bm_j)
    

    This approach revolves around the fact that \(\mathcal{L_D}\) is convex in \(P\) for a fixed \(Q\) and vice versa. With fixed counterpart, our embeddings have a closed form, least squares solution. This however assumes invertibility of the ratings matrix. This is not practical here because the matrix is sparse and its inversion would be costly. Hence we also use gradient descent for alternating optimization.

  2. Simultaneous updates:
     for _ in range(n_epochs): # break when convergence is reached
       shuffle(dataset)
       for (i, j, r) in dataset:
         update(p_i)
         update(q_j)
         update(bu_i)
         update(bm_j)
    

    Update everything at the same time, i.e. per encountered datapoint \((i,j,r)\), update \(q_i, p_j, bu_i, bm_j\).

  3. Coordinate Descent:
     for d in range(n_dimensions):
       for _ in range(n_epochs): # break when convergence is reached
         shuffle(dataset)
         for (i, j, r) in dataset:
           update(p_i[d])
           update(q_j[d])
           update(bu_i)
           update(bm_j)
    

    Per datapoint \((i,j,r)\), instead of updating \(p_i, q_j\), we update only one dimension of the embedding vectors at a time. In particular we only update the \(d\)-th dimension until we have reached ‘convergence’. We then continue by fixing dimension \(d\) and start the same procedure for dimension \(d+1\). To my knowledge, this approach was popularized by Simon Funk. I’m a little puzzled as to why this method beat the simultaneous updates approach.

Further reading