Fine tuning a automatic speech recognition model with my own dataset

Question

I'm using wav2letter to develop a speech-to-text system.
wav2letter has pre-built acoustic and language models which is great, however the audio that I am transcribing from is unique in comparison to what the model has previously observed.
The audio predominantly consists of people talking on CB radios using code words or lots of abbreviations. As a result, the pre-built models just don't cut it, they miss certain phrases, which are really key (there are also new words it needs to learn). So, I'm trying to fine-tune one of their pre-built models with my own dataset. I've got about 4 hours of audio , which I can use to fine-tune the model (this could be insufficient, I know).
When I run training and generate the resulting graph of the model's performance, it seems to me like the model is underfitting.

My understanding is that underfitting happens when the data has not been normalised (?). Well, in my case I think that means I just have to check that the sample rate and number of channels on my dataset matches what was used to build the original model.
Would this be the way to resolve or underfitting? If so, what other things do I need to do to normalise my data? I have also tried audio normalisation on each file so that they are all the same volume.
If this is not the 'right' direction, what methods/techniques would be recommendeded for my particular task?
Any thoughts would be greatly appreciated.

Fine tuning a automatic speech recognition model with my own dataset

Add your own answers!

Ask a Question