Data Science Asked by Vidhya Shankar on August 1, 2021
Before I present my problem, please note that I am a newbie in deep learning and I am trying things for the first time. Most of my code/logic were adopted from various references in the internet.
Goal : Build a LSTM/CNN model to classify the IMDB reviews available in tensorflow datasets
Approach 1 : 1) LSTM based – train data – 45000 (10% validation split),test data – 5000 , accuracy > 95% , validation_accuracy > 85% , glove embeddings of size 100 was used
Approach 2 : 1) CNN model – a) train data – 45000 , test_data – 5000
b) train data – 50% , test_data – 50%
accuracy > 95% , validation_accuracy > 85%
Problem : Test_data accuracy doesn’t go beyond 52% with both the approaches.Most of the code/references available out there use test_data during training. test_data wasn’t part of my training.
Methodologies tried to increase test accuracy :
My guess is there isn’t enough training data. I need help on how to increase the test data accuracy.
Most of the code/references available out there use test_data during training. test_data wasn't part of my training.
While this is the way we should do it but stuff like Encoding must be done holistically.
In your case, you have called the pre_process
separately for Test and Train.
So, the words are converted to Numbers independently. This should not happen.
tokenizer.texts_to_sequences(test)
Above Tokenizer should be the one that was fit on train data.
If I randomly print token with key 101 for train, test. This is the result
print(train_tokn.index_word[101])
print(test_tokn.index_word[101])
think
characters
I think you should use the train_tokn for the test data and it should improve. I believe a very simple LSTM can achieve 85% on this dataset
Or, manually embed both Train, Test using the GloVe embedding.
A simple example for the issue
from keras.preprocessing.text import Tokenizer
train = ['I am sorry']
test = ['I am very sorry']
max_words = 10
# Train
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train)
tokenizer.index_word # {1: 'i', 2: 'am', 3: 'sorry'}
# Test
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(test)
tokenizer.index_word # {1: 'i', 2: 'am', 3: 'very', 4: 'sorry'}
Correct answer by 10xAI on August 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP