Segment 5-7 min audio into sentence wise audio clips for creating speech recognition dataset

Question

I am trying to create a speech recognition dataset especially for Indian Accents.
I am taking from colleagues to build this.
Daily i send a article link and ask them to record and upload to google drive.
I have a problem with this approach.
All  audio recordings of length 5 -7 min.
I am using DeepSpeech model for this and it requires 10 sec audio sentences .
Suggest me any approach if possible to segment audio files into corresponding sentence phrases or to build a better with 5 min length audio files.
Suggestions are more than welcome on better way to create a speech to text dataset.

I apologize in advance if this stack overflow is inappropriate for this question.

jonnor · Answer

The typical approach is to just cut the clips into consecutive sections, and run the model on each such section. Sometimes a bit of overlap is used, say 10%. then you have to decide what to do with potential conflicts in these overlaps.
A good model is usually robust against silence, otherwise you can try to cut silence in start and end of your 10-second window.

librosa.util.frame is a practical way of doing this in Python.

Segment 5-7 min audio into sentence wise audio clips for creating speech recognition dataset

One Answer

Add your own answers!

Ask a Question