Apply Labeled LDA on large data

Question

I'm using a dataset contains about 1.5M document. Each document comes with some keywords describing the topics of this document(Thus multi-labelled). Each document belongs to some authors(not just one author for a document). I wanted to find out the topics interested by each author by looking at documents they write. I'm currently looking an LDA variation (labeled-LDA proposed by D Ramaga: https://www.aclweb.org/anthology/D/D09/D09-1026.pdf .). I'm using all the documents in my dataset to train a model and using the model to predict the labels of an author(here I use the aggregation of all documents belonging to an author to represent him, treat authors like documents).

I'm using python and I found some implementation available on Github:
https://github.com/JoeZJH/Labeled-LDA-Python
https://github.com/shuyo/iir/blob/master/lda/llda.py

I tried the first one and tested it on a small data. It worked but when it came to a big data, the memory required grew a lot (For initialisation, this algorithm generalizes a zero matrix of size #labels times #terms/#vocabulary). I think I need to find a way to reduce the dimension of my corpus. I thought about directly reducing the size of my vocabulary, but it will reduce the length of my texts. And LDA does not work for short text. I’m afraid that if I do this, my texts will not be enough and some information will be lost in the process.

So here are my questions:

Is my method theoretically correct? (I will appreciate it if someone can propose other models suitable for this task)   
If there is any method to reduce the dimension while keeping as much information as possible. For instance, is TF-IDF an option for the reduction?

I do appreciate it if you can give some advice on those two questions.

Alex Nikiforov · Answer

You matrix will be sparse, so the memory usage will be inefficient. Try to look on LSH - locality sensitive hashing for clustering.

https://en.wikipedia.org/wiki/Locality-sensitive_hashing

fuwiak · Answer

Is my method theoretically correct? (I will appreciate it if someone can propose other models suitable for this task)

Yes, but you have to remember about stemmatization, lemmatization and removing stop-words and punctuantion.

If there is any method to reduce the dimension while keeping as much information as possible. For instance, is TF-IDF an option for the reduction?

One of best an option.

Apply Labeled LDA on large data

2 Answers

Add your own answers!

Ask a Question