Data Science Asked by Carl Molnar on December 3, 2020
How can I train a NLP model with discrete labels that is based on multiple text input features?
I’m trying to predict the difficulty of a 4-option multiple choice exam question (probability of a test-taker selecting the correct response) based on the text of the question along with its possible responses. I’m hoping to be able to take into account how some incorrect yet convincing responses, the exact subject of which is relative to the content of the question, can skew the difficulty of the exam question, as well as how the wording of the question can make the question misleading.
My intuition is that the content of the question and its responses are both significant in the prediction of its difficulty. However, when using a library like Spacy, NLTK or Textacy, training seems to be done on only one text column at a time. I’m looking at potentially five text columns at a time, or two if I concatenate the question responses together.
I haven’t been able to find much on the topic, but here is an attempt I found. I thought this attempt was flawed because they were just doing a single-column train multiple times, and for example training the City against a salary value and concatenating that to your train of Job Description against salary value is not going to give a meaningful improvement to your first model, since the Job Description did not depend on the City when training.
My options that I’ve found are:
Thoughts and advice? Is there a library that can make this easier? Thanks!
Concatenating the whole question and its answers in a RNN could be an option to try, but then always use a reserved special token (or various) to mark where the questions start. E.g. you could concatenate like:
Question text <1> answer 1 <2> answer 2 <3> answer 3 <4> answer 4
where <1>, <2>... are the special tokens so that the model, with enough examples, may be able to understand its meaning.
The "enough examples" is worth stressing, specially as this type of model may require a large complexity (in terms of number of parameters) to make it work, and hence you will need a large enough data set, possibly of the order of the 10k or larger. You can also test some data augmentation by mixing the order of the answers in the input, and of course changing the correct label.
Answered by Escachator on December 3, 2020
NLP is rather quite hyper-dimensional. I'd go data-driven way and use some pretrained embedder. Nowadays there're a few to choose from, like LASER from Facebook. There's unofficial pypi lib, though it works just fine. If you want to reach seminal-like scores, there's no point in doing NLP by hand. Embedders usually cover dozens of languages, so you can feed training data in any language you want. Your models will also work for those languages out-of-the-box, even if you trained them on other languages. If you need some custom stuff, you could pick BERT from Google, though you'll have to push it yourself further. It isn't really pretrained that much.
You can try to encode the question and each of the answers separately and go ensemble with it. You could also try to scram it all into encoder. It should do just fine.
Answered by Piotr Rarus - Reinstate Monica on December 3, 2020
For your training, it looks like you want to combine features that do not necessarily refer to the same context. In essence, you have a questions feature and a responses feature. Those two happen to be text data but it could be numerical as age or categorical as gender. For this reason you may need a model able to combine different inputs, like a multi layer network.
You will have an LSTM to work on your questions features and an LSTM to work on your responses features. You will combine the outputs of your two LSTMs at a second stage into a fully connected layer before passing it to your final activation.
In specific, steps include:
questions
and responses
separately. This is needed to ensure that the batch sizes are the same.Questions_INPUTS ------> LSTM-1 -------->
|---> MERGE ---> SIGMOID
Responses_INPUTS ------> LSTM-2 -------->
I have developed similar combined model systems in pytorch. In this respect, the following appendix maybe of help.
Utils: sklearn.model_selection.train_test_split(*arrays, **options)
and torch.utils.data
# train/test data split
questions_train, questions_test, responses_train, responses_test, train_y, test_y = train_test_split(
questions, responses, y, train_size=0.666, random_state=666)
# create tensor dataset
train_data = TensorDataset(
torch.from_numpy(questions_train), torch.from_numpy(responses_train),
torch.from_numpy(np.array(train_y))
)
# create dataloaders
train_loader = DataLoader(
train_data, shuffle=True,
batch_size=666, drop_last=True)
def forward(questions_batch, responses_batch):
# Handle questions features
X1 = self.embedding(questions_batch) # I assume you deal with text
# X1.shape = (batch_size, questions_len, embed_dim)
output, (h, c) = self.lstm(X1, self.hidden)
X1 = h[-1] # I assume unidirectional LSTM to keep example simple
# X1. shape = (batch_size, embed_dim1)
# Handle responses features
X2 = self.embedding(responses_batch) # I assume you deal with text
# X2.shape = (batch_size, responses_len, embed_dim)
output, (h, c) = self.lstm(X2, self.hidden)
X2 = h[-1] # I assume unidirectional LSTM to keep example simple
# X2. shape = (batch_size, embed_dim2)
# Merge features
X = torch.concat([X1, X2], dim=1)
# X.shape = (batch_size, (embed_dim1+embed_dim2) )
X = self.fc(X) # 1 or more fully connected linear layers
# X.shape = (batch_size, output_dim_fc1)
# Last layers to get the right output for your loss
...
See relevant answers and discussion here:
Answered by hH1sG0n3 on December 3, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP