Generalise a CNN model for Question vs Spam . Techniques to generalise the model apart from more data

Data Science Asked on January 16, 2021

For my problem, every image which is not a picture of a question on a paper (either from text book or handwritten), is a spam. It means that each and every image in this world is a spam for my case except the pictures of textbook/handwritten questions. I have used ResNet50 with 47000 questions and 43000 spams images. For the spam questions, I have used Coco 2017 test set with 42K images. My model gave me an awesome val and train metrices prec,rec,acc, f1 of all greater than 0.99. But on the new images, it performed so poorly that it just gave me 0.6 F-1 score.

What should I do make a generalised model apart from using data augmentation and more data. And if I have to use more data, how much would be sufficient? I resized my images to 224,224. Do I need to use large images? Data collection is not a problem as I have millions of images and for the spams, I can use the Google Public dataset of more than 70 Millions images. But it’ll cost me a huge computational power and load.

What are the other methods that I can try out?

cnn deep learning image classification machine learning neural network

Add your own answers!

Ask a Question

Get help from others!

Recent Answers

Jon Church on Why fry rice before boiling?
Peter Machado on Why fry rice before boiling?
haakon.io on Why fry rice before boiling?
Lex on Does Google Analytics track 404 page responses as valid page views?
Joshua Engel on Why fry rice before boiling?