How to group chat messages by topic?

Question

I am a newbie in this field. Developer since 20 years and more but never done anything (except tutorials) with ML, DL, and NLP. Though I've already read a bunch of articles and tutorials about this technology and I am starting to figure out the steps and the conditions required to make it work.

What I would like to achieve (the reason of my question) is this:

I have a file containing about 2 years of conversations between me and another person. I would like to extract sequences of messages that are related by the same topic. I mean messages that are sequential in time and that belong to the same conversation about a topic.

My goal is to extract the time we spent on every single topic.

There is any model already trained for this task? (This is of course the obvious question :-P )

Or, there is any model that I can use as a start training base?

Or, if not, what should be a good approach (steps, techniques, software) to train one by my self? (If possible).

Thanks

Update

Thanks to Erwan response I made some more research. I am not sure if Erwan's answer can be considered a resolutive answer but no doubts that it pointed me to a possible research direction "removing part of the fog in front of my eyes". Since, in my case I don't have a labeld dataset to train a supervised model I started to search for LDA solutions (as you implicitly suggested and also based on several tutorial I found) and found some: gensim's LDA model and parallelized LDA model; lda-project; I also found a Java implementation that is MALLET LDA class ecc...

I also started to browse arguments like Word2Vec, FastText (and family - I guess), even if I am not sure what are the purpose of these software compared to LDA, possibly because I have not a clear and complete view of the informations that LDA models can "capture" compared to the others.

Though the first step was to find a software that supports Italian language "natively" and I found spaCy a tool that provides lemmatization, tokenization, ecc... and seems to have the purpose to prepare text for other topic detection softwares (even if it seems it has its own class for this, guessing by name, TextCategorizer even if I am not able to figure out how to use it because of the lack of a complete documentation and examples).

So my guess is I could use spaCy to prepare the text to be feed to gensim's LDA (or any other implementation).

For data visualization I found pyLDAvis.

Erwan · Answer

This problem is related to the following standard problems:

topic classification/modeling, which ranges from simple supervised document classification to unsupervised assignment of topics distributions to every document (the advanced option, with Latent Dirichlet Analysis and variants) 
sequence labeling, a supervised task which predicts classes for every instance in a sequence, taking into account the order of the instances (e.g. it can leverage the fact that the class of instance $n$ is influenced by the class of instance $n-1$). 
text segmentation, more precisely topic segmentation in this case.

I think that one can find a lot of good implementations for the first two problems, which are very common. However adapting these to your case and/or combining them will probably be more complex.

If you have a small set of topics and you have (or can have) a reasonable sample of  messages annotated with these topics, I would suggest starting with sequence labeling with Conditional Random Fields. I think this could give good results given the sequential nature of chat conversations.

How to group chat messages by topic?

One Answer

Add your own answers!

Ask a Question