Training a sound localization neural network

Question

I am trying to train a neural network, to estimate the location (in degrees from 0 to 180) a sound is coming from.

I am using tensorflow keras in python to train the model.

The input data are two binaural cues, specifically the ILD (Interaural Level Difference) and the ITD (Interaural Time Difference), each vector, consisting of the two above described features, is of dimensions [1,71276]. I have a total of 2639 measurments, 10% of which is used as validation data, and another 10% as test data.

The output should be an angle in the range [0,180].

I have normalized the data in the range [-1, 1] and the best loss I've been able to achieve is MSE = 16.

The model that achieved the highest MSE is the following:

model = tf.keras.Sequential(([
    tf.keras.layers.Input(shape=(71276,), name='input'),

tf.keras.layers.Dense(units=900,activation='relu', name='dense_1'),
    tf.keras.layers.Dense(units=360,activation='relu', name='dense_2'),
    tf.keras.layers.Dense(units=180,activation='relu', name='dense_3'),

tf.keras.layers.Dense(units=1,activation='linear', name='output')
]))

model.compile(loss='mse',
              optimizer=tf.keras.optimizers.Adam(lr=0.001),
              metrics=['mae'])

EPOCHS = 500
BATCH_SIZE = 32

callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', min_delta=0.5, patience=100, verbose=1),
    tf.keras.callbacks.ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min', save_best_only=True),
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=50, verbose=1, mode='min', min_delta=2, cooldown=0, min_lr=0.000001)
]

history = model.fit(X_train, y_train, validation_data=(X_val,y_val), shuffle=True,
                    batch_size=BATCH_SIZE, 
                    epochs=EPOCHS, verbose=1,
                    callbacks=callbacks)

Since this is the first neural network I've trained using my own data, I'm wondering whether there is anything obvious I've missed that could reduce the loss function, and if not, any suggestion is welcome!

I should note that I'm using google colaboratory and I've already tried adding another hidden layer but I got a ran out of memory error. I've also tried increasing/reducing the number of neurons in each layer but I haven't gotten better results and I tried using a CNN architecture as well, with little success as it didn't even converge after 300 epochs.

Thanks in advance!

rigo · Answer

Change your activation layers, use sigmoid or Tanh for your final layer.

I would try CNN again but with different strides, filter sizes, and number of filters. The thing about CNN is that because you have fewer features per layer you will be able to have more layers.

Here is an example of a Convolutional layer used for audio:

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.layers.advanced_activations import LeakyReLU
model = Sequential()
model.add(Conv2D(16, (3,3), padding='same', input_shape=(513, 25, 1)))
model.add(LeakyReLU())
model.add(Conv2D(16, (3,3), padding='same'))
model.add(LeakyReLU())
model.add(MaxPooling2D(pool_size=(3,3)))
model.add(Dropout(0.25))
model.add(Conv2D(16, (3,3), padding='same'))
model.add(LeakyReLU())
model.add(Conv2D(16, (3,3), padding='same'))
model.add(LeakyReLU())
model.add(MaxPooling2D(pool_size=(3,3)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(64))
model.add(LeakyReLU())
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

If you have time, try training an LSTM layer.

Training a sound localization neural network

One Answer

Add your own answers!

Ask a Question