Data Science Asked by user99347 on January 13, 2021
For example, translate English sentence A to French sentence B.
During training with ith word in B, all previous words before B will be fed to decoder, whose length will change for different i. How this is handled so that it can fit into a fixed dimension in the final linear layer during TRAINING?
For feeding word one by one in transformer network we pass the whole sentence along with a mask to the network. And the mask will do the job by unmasking one new word at a time.
Answered by SrJ on January 13, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP