Reducing the size of a dataset

Question

I am trying to classify gestures. I am using Python's scikit learn library classification algorithms for that. I have collected depth images for this purpose. 200 samples are collected for each gesture. Each gesture is made up of 25 frames and each frame is of size 240x420. I tried PCA for dimensionality framewise for reducing the size of each gesture (200 samples each) to make it easy to run on the machine. Still the large size of the data make it difficult to run in my machine when the number of gestures to classify are larger than 4. I am looking for methods to make it run on my machine.

N. Kiefer · Accepted Answer

There are a number of ways to tackle this, I am going to focus on feature selection/extraction, because you mentioned PCA.
Sklearn itself offers a few feature selection/extraction algorithms already, see here, like SelectKBest. This would mean for you to maybe select specific frames, or samples, or even pixels (unlikely).
Further it has not only PCA but a few more, see here. I am going to mention PCA, NMF, ICA. While you already apparently tried this, it is important to note that these algorithms als have to be tuned correctly.
Now, on the other hand, like Graph4Me already mention you can use a CNN. For this you can use the structure of an AutoEncoder, which tries to learn a minimal representation of the input, to correctly restore the input as output. The Decoder-Encoder structure can be trained, to then only use the Encoder as a way to obtain dimensionally reduced training samples. A tutorial (for pytorch) is here, although it is for text, the same principle can be applied to images and videos.
As a final note, you can obviously try some simple preprocessing, like cropping the video, reducing the framerate, converting to grayscale if not already done, or even just black and white. I also hope you are already processing your data sequentially in every regard, if loading everything at once is an issue.

Graph4Me Consultant · Answer

Your question is missing some details about your approach. So I will try to answer the question with the information given, and show you what are the missing information.
Dataset
You have depth images recording different gestures.
Each image has a resolution of 240x420, and you have 200 images per gesture.
I assume each image has one channel (depth). A gesture consists of 25 images.
You have $>4$ classes, but you are limited to 4 classes due to computational issues.
Now do you want to classify based on a single image, or on 25 images which form the gesture? I assume you want the latter.
Next question: What takes too long with more than 4 classes ? Training or inference? I assume PCA, as I discuss later..
If classification is performed on 25 images, the input size 240x420x25 is smaller than taking a RGB Full-HD image (assuming that the quantization is similar to a RGB-Camera), so this is still doable.
Thus based in the input size, at least a neural network can be used if run on a powerful GPU.
What is the machine you want the ML system to be runnable?
ML Pipeline
You do not explain exactly on which datamatrix you apply PCA and what is the classification algorithm you use.
I assume the following:
As you indicate that the computational complexity increases with the number of classes, you apply PCA directly to the input images and the higher the number of classes, the bigger the data matrix and the more eigenvectors you compute, both resulting in high computational costs.
You should think about if PCA is really the right tool to use.
First of all, what is the size of the data matrix?
If I understand your approach correctly, with $k$ classes, you have $k times 25 times 200 times 240 times 420$ pixels. So with $k=5$ you have about $2.5 times 10^9$ entries in your data matrix.
This is indeed very big!
Further, it is true that PCA is some kind of data reduction. However, even  if PCA was fast enough, there are some limitations you should be aware of.
PCA will find some linear embedding space where you data lies in. Using the first $k$ Eigenvectors gives you the most important, principal components for your data.
There are two big issues with that:
1.) PCA is applied directly to depth measurements which is very noise. Also the measurement space is extremely large so that it is difficult for ML to interprete every possible image. Instead you should reduce the "complexity" of the input.
This is done by applying PCA/ML on image descriptors, which are like equivalence classes.  They group different inputs together that are considered equal, e.g. you only look at edges found in an image and do not care about the actual intensities. Images with the same edges are then grouped together.
There are many image descriptors (e.g. HOG, SIFT, SURF) which might need to be adjusted depending on the purpose. Another example would be that you might not care where exactly within the image the gesture is shown. So you want translation invariance. Hence, you should use image descriptors (vectors), which have also translation invariance.
2.) Since PCA assumes a linear embedding space, this might be wrong, depending on your data.
You can use an Autoencoder (which is a CNN), which generalizes a PCA. It can be used as a dimensionality reduction tool, but it allows for non-linear embedding spaces.
Alternatives
In summary, I doubt PCA is the correct tool to use.
I see several alternatives:
1.) You can start with a CNN (where you can feed in 25 frames directly), and simply train it for the different classes. This might give you with a vanilla CNN already satisfactory results. And it implicitly learns useful image descriptors, so that you do not have to care about that.  You can also use a recurrent network which allows you to feed in one image at a time. You can use an Autoencoder for dimensionality reduction.
2.) There are many works focusing on action recognition in videos, based on a CNN, which is exactly what you want (https://towardsdatascience.com/deep-learning-architectures-for-action-recognition-83e5061ddf90).
3.) If the setup is very constrained you should use the prior-knowledge, e.g. for hand-gestures, you can first try to obtain the pose of the hand. This gives you a descriptor of the hand, which is then used to classifiy the gesture (e.g. using a CNN). Such an approach is much better than all alternatives I have mentioned so far. If done properly, it will be less prone to overfitting and might not depend on too big datasets.
4.) Use a proper, hand-crafted image descriptor and apply some ML classification tool on that.  The image descriptor will drastically reduze the dimension and remove noise and redundancies.
5.) Alternatives 1.) and 2.) are kind of black-box solutions which will kind of work but are prone to overfitting. Alternative 3.) and 4.) is robust but difficult to design. If this project is for a company and you need a robust and reliable result, you could contact me as I've worked on such problems for big companies already.

10xAI · Answer

I hope you are aware of the fact that the default type of NumPy is float64 even if it is not required.
In this case you can easily change it to 'float16' without losing information. It can reduce size by 30GB for 10 gestures.
import numpy as np
image_1 = np.ones((240,420))   
image_2 = np.ones((240,420))
image_1 = image_1.astype('float16')

import sys
diff_bytes = sys.getsizeof(image_2)-sys.getsizeof(image_1)
total_diff = diff_bytes*200*25*10 #Assuming 10 gestures
total_diff_GB = total_diff/(10**9)
print('Memory saved in GB - ', total_diff_GB,' GB')

Memory saved in GB -  30.24  GB

A very generic approach for such requirements is shown in below snippet
It automatically checks the max value and adjust size accordingly.
dataset = pd.read_csv("/content/train.csv.zip")
init_mem = dataset.memory_usage().sum() / 1024**2
print('Initial memory size '+str(init_mem)+' MB')
for col in dataset.columns:
    col_type = dataset[col].dtype
    col_max_val = dataset[col].max()
    
    if str(col_type)[:3] == 'int':
        for dtype in list([np.int8,np.int16,np.int32,np.int64]):
            if col_max_val/np.iinfo(dtype).max < 1:
                dataset[col] = dataset[col].astype(dtype)
                break
    elif str(col_type)[:3] == 'flo':
        for dtype in list([np.float16,np.float32,np.float64]):
            if col_max_val/np.finfo(dtype).max < 1:
                dataset[col] = dataset[col].astype(dtype)
                break    
    else:
        dataset[col] = dataset[col].astype('category')

fin_mem = dataset.memory_usage().sum() / 1024**2
print('Final memory size '+str(fin_mem)+' MB')

Reducing the size of a dataset

3 Answers

Add your own answers!

Ask a Question