Input shape of an Xception CNN model

Question

I have a Keras Xception based model for gesture recognition. The accuracy of the model is around 60-70% for classifying 7 different gestures. The training dataset consists of 320x240 and 640x480 pixel images. Currently, I'm leaving the input_shape parameter of the model equal to default value of the Xception model in Keras, which is (299, 299, 3). I assume under the hood the network is rescaling all inputs to 299x299 pixels, which probably isn't a good approach.
My questions are:

Is the Xception model somehow optimized for the 299x299 image size? That is to say, should I aim to crop/pad the input to 299x299 pixels rather than change the model's configuration?
All usage examples I've seen so far have input width = height. Is there a reason to prefer square images?
If I don't use cropping/padding, I have two options for the input shape: rescale all images to 640x480 in a preprocessing step, or rescale all to 320x240. Is the 640x480 option likely to result in much better accuracy?

Abhishek Verma · Answer

For your first question, yes, it is optimized for that size since the original paper for Xception used 299x299 size. But, you can use other sizes. You should resize your images to 299x299 that would be the best.
For your second question, the reason height = width because in the network, the convolutional filters which are used are square (3x3 filters). The reason for using square filter in computer vision is we assume that the features in image are symmetric most of the times (exception being text where the information is more on the vertical dimension than horizontal, 1x2 filters are used there).
For your third question, go for the smaller size, because if you rescale the smaller images to bigger ones, you add useless information into image since it is derived from the smaller image itself. Also, you create a model with more parameters.

Input shape of an Xception CNN model

One Answer

Add your own answers!

Ask a Question