TransWikia.com

Memory error - Hierarchical Dirichlet Process, HDP gensim

Data Science Asked by work_in_progress on September 2, 2021

I am running Hierarchical Dirichlet Process, HDP using gensim in Python but as my corpus is too large it is throwing me following error:

model = gensim.models.HdpModel(corpus, id2word=corpus.id2word, chunksize=50000)



 File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 210, in __init__
    self.update(corpus)
  File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 245, in update
    self.update_chunk(chunk)
  File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 313, in update_chunk
    self.update_lambda(ss, word_list, opt_o)
  File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 415, in update_lambda
    rhot * self.m_D * sstats.m_var_beta_ss / sstats.m_chunksize
MemoryError

I have loaded my corpus using following statement:

corpus = gensim.corpora.MalletCorpus('chunk5000K_records.mallet')

And the data which I used to load corpus has 5 million records. And this is working for me when I am loading only 50K records. So I have added chunksize option HdpModel but it is still giving me an error.

Please let me know how I can solve this issue. And I am running this on High Performance Computing so I think there should be a solution to resolve this issue as this cluster has really big size memory and disk capacity.

2 Answers

you can use an alternative of HDP that is LDA. HDP won't give hierarchical output. HDP and LDA are both creating a flat hierarchy. The only difference is that HDP is generating topics based on topic generated in a pervious iteration. Online LDA is quite memory efficient as well as good at capturing topics.

Answered by Gaurav Koradiya on September 2, 2021

Upgrade to Python 3.x if at all possible. It is much more memory efficient than Python 2.7.

Additionally, genism has a guide to improving code performance. It is called Distributed Computing but has a section on improving single node performance. One suggestion is to make sure a fast BLAS (Basic Linear Algebra) library for NumPy is correctly installed and used.

Answered by Brian Spiering on September 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP