TransWikia.com

memory error in matrix cosine_similarity

Data Science Asked by jake Monk on December 21, 2020

I have (20905040, 7)
of a dataset to recommend 10 different product to the user
it could be larger than that but anyway I got memory error when processing the

cosine_sim = cosine_similarity(normalized_df,normalized_df)

————————————————————————— MemoryError Traceback (most recent call
last) in
1 get_ipython().run_line_magic(‘time’, ”)
—-> 2 cosine_sim = cosine_similarity(normalized_df,normalized_df)

~/venv/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in
cosine_similarity(X, Y, dense_output) 1034 1035 K =
safe_sparse_dot(X_normalized, Y_normalized.T,
-> 1036 dense_output=dense_output) 1037 1038 return K

~/venv/lib/python3.6/site-packages/sklearn/utils/extmath.py in
safe_sparse_dot(a, b, dense_output)
140 return ret
141 else:
–> 142 return np.dot(a, b)
143
144

MemoryError:

questions
1. when I have too much rows how do I apply cosine similarity?
2. do they talking about ram memory? or what memory error ?
3. is there way to use gpu for cosine similarity training?
4. any good idea?

One Answer

This is talking about RAM. There are a few factors that will decide how many rows/columns you can use. Instead of rows/columns, it is maybe easier to just think in total number of elements: num_rows * num_cols. The memory you will require is going to have a relationship to this number.

There are ways that might take less working memory to solve the problem - usually memory and speed are part of the trade-off. If you use a lot less memory while computing the result, you can do fewer at a time and so it takes longer.

If you have floating point numbers (those with decimals), then Pandas usually uses the data type float64 by default. You could try using float32 instead. It offers about half of the accuracy, but also only uses half of the memory. You can do this by simply adding this line before you compute the cosine_similarity:

import numpy as np

normalized_df = normalized_df.astype(np.float32)
cosine_sim = cosine_similarity(normalized_df, normalized_df)

Here is a thread about using Keras to compute cosine similarity, which can then be done on the GPU. I would point out, that (single) GPUs will generally have less working memory available than your computer itself.

Here is a blog that speaks about how to scale up your computation to use tools like Spark, for distributed computing. That would allow you to deal with much larger matrices, provided you have several computers available.

Answered by n1k31t4 on December 21, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP