TransWikia.com

Downsampling audio files for use in Machine Learning

Data Science Asked by Finn Maunsell on September 19, 2020

I’m trying to use the work (Neural Networks) done in this repo:
https://github.com/jtkim-kaist/VAD

It says this:

Note: To apply this toolkit to other speech data, the speech data
should be sampled with 16kHz sampling frequency.

I’ve got speech data at 48khz. I’ve read in places that reducing sampling rate is a complicated process, you can’t just remove every nth datapoint, you have to filter things…

Is this necessary if I only intend to use the data in the Neural Network toolkit provided by the repo I linked?
If so, is there an industry standard method for changing sample rate?

I realise that it probably depends on what features are being used. However the feature that is used is this:

MRCG (multi resolution cochleagra) concatenates the cochleagram features at multiple spectrotemporal resolutions

This is a ruddy complicated feature! Lets pretend we’re just using a Melspectogram (unless you’re willing to answer the question from the perspective of MRCG’s).

Neural networks are likely to use features of a Melspectogram that we wouldn’t think of. This makes me think it is unwise to train the Neural Net using downsampled speech data unless we intend to predict using 48khz data downsampled to 16khz forever after…

What do you think? Can I use my 48khz data – downsampled with no filtering – with the expectation that the model will work for prediction on real 16khz data?

And then for future readers sake, how about the other way? Say I had an 8khz file, could I increase the sample rate to 16khz without filtering?

One Answer

Resampling of audio is a standard process and there are many implementations available. In Python you can use librosa, or you can write a script that uses ffmpeg or similar.

If you want to reuse an already trained model this is critical, as the neural network will have learned features based on 16kHz input. It might work OK if you re-train the model with your data using 48kHz. It will be slower to train and predict, but performance can be similar. However your would need to update the audio_sr parameter of the audio feature extraction. Otherwise the length of window in milliseconds would be wrong, probably no longer matching what is typical for speech, resulting in degraded performance.

Answered by jonnor on September 19, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP