How to prepare Audio-text data for speech recognition

Question

I have gathered some raw audio from all the conferences, meetings, lectures & casual conversation that I was part of. The machine transcription did not offer good results (from Azure, AWS etc.) I would transcribe it so to have both data+label (audio+text) for ML training.
My question is whether to have small (3-10 sec.) audio files (split it at silence) and then transcribe each small file? or large file with timestamps with subtitle.srt format?
What if I have a long duration audio file with text? I heard long files have more chances of errors & not accurate training. What if I add timestamp like subtitle file srt?
Do I need small audio files?
I tried Azure custom Speech to train and test but it threw errors, saying it wont process large audio files. (so small chunks is recommended.)
What other ML platforms (AWS, Watson, GCP) have it their data labelling criteria? Sorry I couldn't find other than MS Azure.
Ideally I shall have its own speech recognition system with clean slate (open to hear suggestions on model selection), but need to know what format & style the data should be created.
The way I see it, audio splitting (say, cutting a 30 mins audio into 200 parts) can be automated but then, how to separate transcripts into 200 lines? (need to check manually to linebreak.), So, not a good option to go for large datasets.
Thus it's important to decide the data format before working on it (for assigning the transcribers the proper instructions).
So the question again: with the clean slate (a) to have large audio files with timestamp transcript, or (b) to have small audio files with single line text?
& how?
Please guide. I did bit research but finally dared to post a question.

Nikolay Shmyrev · Answer

In general audio training tools requires data segmented on small chunks. It is not a big problem to segment, you can use segmentation scripts, for example
segment_long_utterances.sh as described in Kaldi group discussion. Many other segmentation tools beside that like dsalign or gentle or aeneas.
Overall, training model from scratch is pretty complicated process which requires time and data (>5000 hours of speech properly annotated). It can take months to build a new model.
You'd better investigate first the reasons your error rate is bad, it might not be solvable with training. If information is already lost, you can not really repair it. A better microphone, no compression for data and many other tricks can improve recognition accuracy significantly. You can also just adapt the language model to improve accuracy in case standard language model is not good for your content. It is easier than acoustic model training and you just need texts as described here.

How to prepare Audio-text data for speech recognition

One Answer

Add your own answers!

Ask a Question