Artificial Intelligence Asked on December 5, 2020
I’m currently trying to train a BART, which is a denoising Transformer created by Facebook researchers. Here’s my Transformer code
import math
import torch
from torch import nn
from Constants import *
class Transformer(nn.Module):
def __init__(self, input_dim: int, output_dim: int, d_model: int = 200, num_head: int = 8, num_e_layer: int = 6,
num_d_layer: int = 6, ff_dim: int = 1024, drop_out: float = 0.1):
'''
Args:
input_dim: Size of the vocab of the input
output_dim: Size of the vocab for output
num_head: Number of heads in mutliheaded attention models
num_e_layer: Number of sub-encoder layers
num_d_layer: Number of sub-decoder layers
ff_dim: Dimension of feedforward network in mulihead models
d_model: The dimension to embed input and output features into
drop_out: The drop out percentage
'''
super(Transformer, self).__init__()
self.d_model = d_model
self.transformer = nn.Transformer(d_model, num_head, num_e_layer, num_d_layer, ff_dim, drop_out,
activation='gelu')
self.decoder_embedder = nn.Embedding(output_dim, d_model)
self.encoder_embedder = nn.Embedding(input_dim, d_model)
self.fc1 = nn.Linear(d_model, output_dim)
self.softmax = nn.Softmax(dim=2)
self.positional_encoder = PositionalEncoding(d_model, drop_out)
self.to(DEVICE)
def forward(self, src: torch.Tensor, trg: torch.Tensor, src_mask: torch.Tensor = None,
trg_mask: torch.Tensor = None):
embedded_src = self.positional_encoder(self.encoder_embedder(src) * math.sqrt(self.d_model))
embedded_trg = self.positional_encoder(self.decoder_embedder(trg) * math.sqrt(self.d_model))
output = self.transformer.forward(embedded_src, embedded_trg, src_mask, trg_mask)
return self.softmax(self.fc1(output))
class PositionalEncoding(nn.Module):
def __init__(self, d_model, dropout=0.1, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
and here’s my training code
def train(x: list):
optimizer.zero_grad()
loss = 0.
batch_sz = len(x)
max_len = len(max(x, key=len)) + 1 # +1 for EOS xor SOS
noise_x = noise(x)
src_x = list(map(lambda s: [SOS] + [char for char in s] + [PAD] * ((max_len - len(s)) - 1), noise_x))
trg_x = list(map(lambda s: [char for char in s] + [EOS] + [PAD] * ((max_len - len(s)) - 1), x))
src = indexTensor(src_x, max_len, IN_CHARS).to(DEVICE)
trg = targetsTensor(trg_x, max_len, OUT_CHARS).to(DEVICE)
names = [''] * batch_sz
for i in range(src.shape[0]):
probs = transformer(src, trg[:i + 1])
loss += criterion(probs, trg[i])
loss.backward()
optimizer.step()
return names, loss.item()
As you can see in the train code. I am training it “sequentially” by inputting the first letter of the data then computing the loss with the output then inputting the first and second character and doing the same thing, so on and so forth.
This doesn’t seem to be training properly though as the denoising is totally off. I thought maybe there’s something wrong with my code or you can’t train Transformers this way.
I’m taking first name data then noising it then training the Transformer to denoise it, but the output to the Transformers doesn’t look remotely like the denoised version or even the noised version of the name. I built a denoising autoencoder using LSTMs and it did way better, but I feel like BART should be way out performing LSTMs cause it’s supposedly state of the art NLP neural network model.
I assume you're referring to this paper, which basically "reuses" the same architecture as the original "Attention is all you need" paper for machine translation; the difference being that source and target are noisy and original sentences.
I can't figure the for loop logic inside your train
function; but reading your comment, you don't have to traverse the target sequentially. Instead, we rely on masks which exclude paddings, and most importantly help with implementing teacher forcing for the decoder stack. Check this post and how it creates the three types of masks needed (src, target, no_peek).
Having said that, you will need a for loop when serving your model (inference), because you will have to decode autoregressively. That is, you'll have to decode a token first, and feed it back to the decoder to generate the next token (until you reach EOS
). Check "Testing the model" section in this article.
Answered by olix20 on December 5, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP