memory error in matrix cosine_similarity

Question

I have (20905040, 7)
of a dataset to recommend 10 different product to the user
it could be larger than that but anyway I got memory error when processing the

cosine_sim = cosine_similarity(normalized_df,normalized_df)

--------------------------------------------------------------------------- MemoryError                               Traceback (most recent call
  last)  in 
        1 get_ipython().run_line_magic('time', '')
  ----> 2 cosine_sim = cosine_similarity(normalized_df,normalized_df)
  
  ~/venv/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in
  cosine_similarity(X, Y, dense_output)    1034     1035     K =
  safe_sparse_dot(X_normalized, Y_normalized.T,
  -> 1036                         dense_output=dense_output)    1037     1038     return K
  
  ~/venv/lib/python3.6/site-packages/sklearn/utils/extmath.py in
  safe_sparse_dot(a, b, dense_output)
      140         return ret
      141     else:
  --> 142         return np.dot(a, b)
      143 
      144 
  
  MemoryError:

questions 
1. when I have too much rows how do I apply cosine similarity?
2. do they talking about ram memory? or what memory error ? 
3. is there way to use gpu for cosine similarity training?
4. any good idea?

n1k31t4 · Answer

This is talking about RAM. There are a few factors that will decide how many rows/columns you can use. Instead of rows/columns, it is maybe easier to just think in total number of elements: num_rows * num_cols. The memory you will require is going to have a relationship to this number.

There are ways that might take less working memory to solve the problem - usually memory and speed are part of the trade-off. If you use a lot less memory while computing the result, you can do fewer at a time and so it takes longer.

If you have floating point numbers (those with decimals), then Pandas usually uses the data type float64 by default. You could try using float32 instead. It offers about half of the accuracy, but also only uses half of the memory. You can do this by simply adding this line before you compute the cosine_similarity:

import numpy as np

normalized_df = normalized_df.astype(np.float32)
cosine_sim = cosine_similarity(normalized_df, normalized_df)

Here is a thread about using Keras to compute cosine similarity, which can then be done on the GPU. I would point out, that (single) GPUs will generally have less working memory available than your computer itself.

Here is a blog that speaks about how to scale up your computation to use tools like Spark, for distributed computing. That would allow you to deal with much larger matrices, provided you have several computers available.

memory error in matrix cosine_similarity

One Answer

Add your own answers!

Ask a Question