What is the best way to store images in python for machine learning

Question

I am currently working on a classification problem that requires me to classify whether an image contains cancerous tissue cells or not. Each image is 50x50x3 pixels, the 3 is for RGB values.

So far I have a pandas dataframe that contains the target value, patient id, image id and the path to the corresponding image.

I can access the image by using

io.imread(df['path'])

So it is possible for me to loop through all the images to access them. The question now is, where do I store the images so that I can apply principle component analysis on them?

If I were to simply store it in a dataframe it would contain 7500 columns; 1 for each pixel value. My dataset contains 280,000 images. That means my my dataframe would need to be 280,000x7500. I Feel that there is a better way to approach this problem.

Your input to this matter would be highly appreciated.

Carlos Mougan · Accepted Answer

This might be a bit more complicated.
I normally reuse computer vision and deep learning software to do that. Even if I don't do Deep Learning.
Particularly I use Pytorch, for its bridge with Numpy and pandas.  Here is a tutorial.
This allows me to use a GPU if wanted, and to reuse a lot of code since for deep learning and images there is tons of code snippets out there.

Dirk Nachbar · Answer

Yes pandas won't work well for this. You can look at sparse data formats https://docs.scipy.org/doc/scipy/reference/sparse.html

Or maybe check how it's done in Tensorflow.

What is the best way to store images in python for machine learning

2 Answers

Add your own answers!

Ask a Question