TransWikia.com

Segment 5-7 min audio into sentence wise audio clips for creating speech recognition dataset

Data Science Asked by Papasani Mohansrinivas on December 10, 2020

I am trying to create a speech recognition dataset especially for Indian Accents.
I am taking from colleagues to build this.
Daily i send a article link and ask them to record and upload to google drive.
I have a problem with this approach.
All audio recordings of length 5 -7 min.
I am using DeepSpeech model for this and it requires 10 sec audio sentences .
Suggest me any approach if possible to segment audio files into corresponding sentence phrases or to build a better with 5 min length audio files.
Suggestions are more than welcome on better way to create a speech to text dataset.

I apologize in advance if this stack overflow is inappropriate for this question.

One Answer

The typical approach is to just cut the clips into consecutive sections, and run the model on each such section. Sometimes a bit of overlap is used, say 10%. then you have to decide what to do with potential conflicts in these overlaps. A good model is usually robust against silence, otherwise you can try to cut silence in start and end of your 10-second window.

librosa.util.frame is a practical way of doing this in Python.

Answered by jonnor on December 10, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP