Training NLP with multiple text input features

Question

Question:

How can I train a NLP model with discrete labels that is based on multiple text input features?

Background:

I'm trying to predict the difficulty of a 4-option multiple choice exam question (probability of a test-taker selecting the correct response) based on the text of the question along with its possible responses. I'm hoping to be able to take into account how some incorrect yet convincing responses, the exact subject of which is relative to the content of the question, can skew the difficulty of the exam question, as well as how the wording of the question can make the question misleading.

Intuition, Attempts and Options:

My intuition is that the content of the question and its responses are both significant in the prediction of its difficulty. However, when using a library like Spacy, NLTK or Textacy, training seems to be done on only one text column at a time. I'm looking at potentially five text columns at a time, or two if I concatenate the question responses together.

I haven't been able to find much on the topic, but here is an attempt I found. I thought this attempt was flawed because they were just doing a single-column train multiple times, and for example training the City against a salary value and concatenating that to your train of Job Description against salary value is not going to give a meaningful improvement to your first model, since the Job Description did not depend on the City when training.

My options that I've found are:

Following the above attempt after all (which I think is flawed)
Concatenating my text features (which I can't understand why that would make sense in this case but seems to be the norm)
Eliminating some of my features entirely, such as narrowing the question down by subject matter and disposing of the question content and just training on the response options concatenated together (which also removes some very important information in the question content that can lead to prediction values)

Thoughts and advice? Is there a library that can make this easier? Thanks!

Escachator · Answer

Concatenating the whole question and its answers in a RNN could be an option to try, but then always use a reserved special token (or various) to mark where the questions start. E.g. you could concatenate like:

Question text <1> answer 1 <2> answer 2 <3> answer 3 <4> answer 4

where <1>, <2>... are the special tokens so that the model, with enough examples, may be able to understand its meaning.

The "enough examples" is worth stressing, specially as this type of model may require a large complexity (in terms of number of parameters) to make it work, and hence you will need a large enough data set, possibly of the order of the 10k or larger. You can also test some data augmentation by mixing the order of the answers in the input, and of course changing the correct label.

Piotr Rarus - Reinstate Monica · Answer

NLP is rather quite hyper-dimensional. I'd go data-driven way and use some pretrained embedder. Nowadays there're a few to choose from, like LASER from Facebook. There's unofficial pypi lib, though it works just fine. If you want to reach seminal-like scores, there's no point in doing NLP by hand. Embedders usually cover dozens of languages, so you can feed training data in any language you want. Your models will also work for those languages out-of-the-box, even if you trained them on other languages. If you need some custom stuff, you could pick BERT from Google, though you'll have to push it yourself further. It isn't really pretrained that much.

You can try to encode the question and each of the answers separately and go ensemble with it. You could also try to scram it all into encoder. It should do just fine.

hH1sG0n3 · Answer

For your training, it looks like you want to combine features that do not necessarily refer to the same context. In essence, you have a questions feature and a responses feature. Those two happen to be text data but it could be numerical as age or categorical as gender. For this reason you may need a model able to combine different inputs, like a multi layer network.
Methodology
You will have an LSTM to work on your questions features and an LSTM to work on your responses features. You will combine the outputs of your two LSTMs at a second stage into a fully connected layer before passing it to your final activation.
In specific, steps include:

Split batching: Prepare your your dataset to allow distribution of questions and responses separately. This is needed to ensure that the batch sizes are the same.
LSTM models: Instantiate two LSTM models that will receive the questions akd responses batches separately.
FC layer(s): Concatenate the output or otherwise the hidden state of the final LSTM cell of each of the two LSTMs and feed into a fully connected layer. Here, you can add more FC layers, essentially making this a hyperparameter to optimise.
Activation: Activation function appripriate for your type of problem e.g. binary, multi-class etc.

Questions_INPUTS ------> LSTM-1 -------->
                                         |---> MERGE ---> SIGMOID
Responses_INPUTS ------> LSTM-2 -------->

I have developed similar combined model systems in pytorch. In this respect, the following appendix maybe of help.

Split batching example
Utils: sklearn.model_selection.train_test_split(*arrays, **options) and torch.utils.data
# train/test data split 
questions_train, questions_test, responses_train, responses_test, train_y, test_y =  train_test_split(
    questions, responses, y, train_size=0.666, random_state=666)

# create tensor dataset
train_data = TensorDataset(
           torch.from_numpy(questions_train), torch.from_numpy(responses_train), 
           torch.from_numpy(np.array(train_y))
)

# create dataloaders
train_loader = DataLoader(
             train_data, shuffle=True, 
             batch_size=666, drop_last=True)

Forward() method example
def forward(questions_batch,  responses_batch):
    # Handle questions features
    X1 = self.embedding(questions_batch) # I assume you deal with text
    # X1.shape = (batch_size, questions_len, embed_dim)
    output, (h, c) = self.lstm(X1, self.hidden)
    X1 = h[-1] # I assume unidirectional LSTM to keep example simple
    # X1. shape = (batch_size, embed_dim1)

# Handle responses features
    X2 = self.embedding(responses_batch) # I assume you deal with text
    # X2.shape = (batch_size, responses_len, embed_dim)
    output, (h, c) = self.lstm(X2, self.hidden)
    X2 = h[-1] # I assume unidirectional LSTM to keep example simple
    # X2. shape = (batch_size, embed_dim2)

# Merge features
    X = torch.concat([X1, X2], dim=1)
    # X.shape = (batch_size, (embed_dim1+embed_dim2) )
    X = self.fc(X) # 1 or more fully connected linear layers
    # X.shape = (batch_size, output_dim_fc1)
    # Last layers to get the right output for your loss
    ...

See relevant answers and discussion here:

https://discuss.pytorch.org/t/sequential-and-static-features-combined-in-lstm-architecture/91115
Adding Features To Time Series Model LSTM

Training NLP with multiple text input features

Question:

Background:

Intuition, Attempts and Options:

3 Answers

Methodology

Split batching example

Forward() method example

Add your own answers!

Ask a Question