Data Science Asked on May 26, 2021
I have collected a lot of data that I would like to analyse and classified. Unfortunately, they are not already labelled, so I am going to do manually.
The dataset consists of texts in Italian and I have not found a lot of models that I could use as a training model for labelling them and classifying them between True and False.
Suppose that I have 30000 of texts, which percentage could be enough to build a model for predicting the rest from that?
Do you have any model that I could build/use once labelled them?
As a person trying to do text classification myself,Let me try to help you to get started here. Please don't bother reading ahead if your question is about finding models which are able to fit the "italian" language features and find co-relation and you already know how to do text classification in general. If not then-
It is recommended to formulate your problem first properly i.e. what exactly are you trying to solve here. "..........classifying them between True and False." It doesn't give a lot of clarity. This should be clear for you.
What kind of data you are dealing with, saying "30000 of texts" does that mean 30000 lines of text? documents? books?
Now comes the part where we can do EDA which might include pre processing of data also. try to plot your data with respect to your labels and get some insights on how to go about the problem, if there are trends , patterns you can utilize.
There are many techniques to get the features out of text data. there is feature selection, feature extraction and there are various methods to do that.
Then you can decide on which classification models will fit the data, i.e.model selection, null hypothesis if I might suggest. You can try various models to fit the data and compare its performances and choose the one which gives you the best results.there is cross validation for example do help you decide on this.There is model evaluation which shall help you decide if the model is able to generalize well on unseen data.
Here is the point about percentage of data used for training and testing. Read about cross validation and hold out method to help you understand it. Generally it s a good idea to 70/30 ratio to train and test the model and also training data can be split into training and validation set.
However, here comes the famous bias-variance problem w.r.t. training size, generally more data leads to better generalize which avoids over-fitting. With less data there are more chances of over-fitting. Hence you should try run the models and plot the graph to understand if its over-fitting or under fitting(if you think you have way low data for any model to fit and predict). https://scikit-learn.org/stable/modules/learning_curve.html
With the best selected model you can start predicting the results.
After you have got the best fit model you can tune its hyper parameters to get the better performance.
Now these are just rough list for you to go on. Please try below link to read about text classification with sk-learn which might be a good starting point if you haven't done this before-
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html https://www.nltk.org/book/ch06.html
Answered by BlackCurrant on May 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP