Data Science Asked by Marco Frisan on October 14, 2020
I am a newbie in this field. Developer since 20 years and more but never done anything (except tutorials) with ML, DL, and NLP. Though I’ve already read a bunch of articles and tutorials about this technology and I am starting to figure out the steps and the conditions required to make it work.
What I would like to achieve (the reason of my question) is this:
I have a file containing about 2 years of conversations between me and another person. I would like to extract sequences of messages that are related by the same topic. I mean messages that are sequential in time and that belong to the same conversation about a topic.
My goal is to extract the time we spent on every single topic.
There is any model already trained for this task? (This is of course the obvious question 😛 )
Or, there is any model that I can use as a start training base?
Or, if not, what should be a good approach (steps, techniques, software) to train one by my self? (If possible).
Thanks
Update
Thanks to Erwan response I made some more research. I am not sure if Erwan’s answer can be considered a resolutive answer but no doubts that it pointed me to a possible research direction “removing part of the fog in front of my eyes”. Since, in my case I don’t have a labeld dataset to train a supervised model I started to search for LDA solutions (as you implicitly suggested and also based on several tutorial I found) and found some: gensim‘s LDA model and parallelized LDA model; lda-project; I also found a Java implementation that is MALLET LDA class ecc…
I also started to browse arguments like Word2Vec, FastText (and family – I guess), even if I am not sure what are the purpose of these software compared to LDA, possibly because I have not a clear and complete view of the informations that LDA models can “capture” compared to the others.
Though the first step was to find a software that supports Italian language “natively” and I found spaCy a tool that provides lemmatization, tokenization, ecc… and seems to have the purpose to prepare text for other topic detection softwares (even if it seems it has its own class for this, guessing by name, TextCategorizer even if I am not able to figure out how to use it because of the lack of a complete documentation and examples).
So my guess is I could use spaCy to prepare the text to be feed to gensim’s LDA (or any other implementation).
For data visualization I found pyLDAvis.
This problem is related to the following standard problems:
I think that one can find a lot of good implementations for the first two problems, which are very common. However adapting these to your case and/or combining them will probably be more complex.
If you have a small set of topics and you have (or can have) a reasonable sample of messages annotated with these topics, I would suggest starting with sequence labeling with Conditional Random Fields. I think this could give good results given the sequential nature of chat conversations.
Answered by Erwan on October 14, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP