Spoken utterance classification on RAVDESS using MFCC

Question

I am planning to classify two audio files in which different sentences are spoken. Don't want to do speech to text as on prem speech to text conversion models are not good, and don't want to go to cloud. So planned to use RAVDESS dataset, which is basically for emotion detection. There are two sentences spoken in RAVDESS dataset - 01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door"
The approach I took is to convert label 0 and 1 for these two sentences audios, extract MFCC signals and then do a binary classification. However, I am not getting accuracy more than 70%. Can someone please let me know what can be the reason. I suspect any of following:

Is the approach itself is wrong?
MFCC signals can not be used for spoken text classification?
Am I missing some feature engineering in MFCC before feeding to classifier?
There can be problem with my code or classifier? I have pasted code below
Any other suggestion?

# Generate data for statement_type
    import time,os
    import librosa, numpy as np
    
    # path = '/content/drive/My Drive/Ravdess/'
    path = '/content/RAVDESS-emotions-speech-audio-only/Audio_Speech_Actors_01-24'
    lst = []
    
    start_time = time.time()
    
    for subdir, dirs, files in os.walk(path):
      for file in files:
        # print(file)
          try:
            #Load librosa array, obtain mfcss, store the file and the mcss information in a new array
            X, sample_rate = librosa.load(os.path.join(subdir,file), res_type='kaiser_fast')
            mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0) 
            # This is because our predictor needs to start from 0 otherwise it will try to predict also 0.
            # file = int(file[7:8]) - 1 
            # file = int(file[13:14]) - 1 
            label = np.array([1,0] if int(file[13:14]) - 1 == 0 else [0,1])
            # print(label)
            arr = mfccs, label
            lst.append(arr)
          # If the file is not valid, skip it
          except ValueError:
            print("error at : " , file)
            continue

# Model and training
model.add(Input(shape=(40,1)))
model.add(LSTM(512, activation="relu", return_sequences=True)) 
model.add(Dropout(0.3))

model.add(Flatten())
model.add(Dense(2))
model.add(Activation('sigmoid'))
opt = keras.optimizers.Adam(lr=0.0001) 
model.summary()

model.compile(loss=keras.losses.BinaryCrossentropy(), optimizer=opt, metrics=['accuracy'])

trainhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=300, validation_data=(x_testcnn, y_test))

Spoken utterance classification on RAVDESS using MFCC

Add your own answers!

Ask a Question