Why are teacher forcing and non teacher forcing test-losses are negatively correlated?

Question

I'm trying to train a recurrent neural network (more specifically an LSTM) to learn to predict a certain time series. The time series has 10 variables with datapoints every 5 minutes and I have data over a period of 50 years.
I initially trained the network with teacher forcing using the following step loss function (using pytorch):
def train_step_tf(input_seq, target_seq, model):
    output, hidden = model(input_seq)
    loss = mse(output, target_seq)
    
    return loss

For all time steps $i in {1, ldots, text{len(input_seq)}}$, this feeds data $x_i$ into the model and computes the mse los between $hat x_{i+1}$ and $x_{i+1}$.
Unfortunately the prediction performance (feeding in a 100 step primer and then predicting the next 100 steps) was very low. I assumed this was because when training with teacher forcing, the network only every sees true timeseries data as input and gets confused when it sees it's own slightly atypical generated data.
I then trained it without teacher forcing using the following step loss:
def train_step_ntf(input_seq, target_seq, model, primer_length):
    primer_inp = input_seq[:, :primer_length, :]

_, hidden = model(primer_inp)
    hidden = hidden.detach()
        
    train_inp = input_seq[:, primer_length:, :]
    train_targ = target_seq[:, primer_length:, :]
    
    input = train_inp[:, 0, :].unsqueeze(1)
    loss = 0
    
    for c in range(train_inp.shape[1]):
        output, hidden = model(input, hidden)
        loss += mse(output, train_targ[:, c, :].unsqueeze(1))
        input = output.detach()
    
    return loss/train_inp.shape[1]

Unfortunately the predictive performance is still very low.
I then trained various network configurations with and without teacher forcing (at train time) and regularly logging both the tf and non-tf test-set loss.
As I understand it, training with teacher forcing should be faster, but not using teacher forcing could allow for better predictive performance. Overall both losses should be equivalent however, in the sense that when the teacher forcing loss goes to zero (meaning the network can predict the next step 100% accurately) the non-tf loss should go to zero as well (predicting the next 100 steps all at once), and vice versa.
Unfortunately when plotting the final tf test loss and non-tf test loss for each run, it seems like they are negatively correlated:

Color indicates whether I used teacher forcing at train time. Of course when training with teacher forcing the teacher forcing test loss is generally better. But weirdly the better a runs teacher forcing test loss was, the worse was the non-tf test loss. Can anyone explain why that is? Thanks

Why are teacher forcing and non teacher forcing test-losses are negatively correlated?

Add your own answers!

Ask a Question