TransWikia.com

When a dataset is huge, what do you do to train with all the images on i t?

Data Science Asked by VansFannel on August 31, 2020

I’m using Python 3.7.7.

I’m trying to load a lot of NIFTI images using SimplyITK and Numpy from the [BraTS 2019 dataset][1].

This is the code I use to load the images into a numpy array.

import SimpleITK as sitk


def read_nifti_images(images_full_path):
    """ 
    Read nifti files from a gziped file.
  
    Read nifti files from a gziped file using SimpleITK library.
  
    Parameters: 
    images_full_path (string): Full path to gziped file including file name.
  
    Returns: 
    SimpleITK.SimpleITK.Image, numpy array: images read as image, images read as numpy array 
  
    """
    # Reads images using SimpleITK.
    images = sitk.ReadImage(images_full_path)
    # Get a numpy array from a SimpleITK Image.
    images_array = sitk.GetArrayFromImage(images)
    
    # More info about SimpleITK images: http://simpleitk.github.io/SimpleITK-Notebooks/01_Image_Basics.html
    
    return images, images_array

This code works fine with smallest dataset but here I’m trying to load 518 nii.gz files with 155 images each file.

To run the code I’m using PyCharm latest version on a Windows 7.

How do you do it to train with all the images if all of them can’t be in memory because memory limits?

2 Answers

When you use Keras, you can use the generator function, which essentially loads images in batches.

See this post for a discussion on how to use (and predict) with the data generator: https://stackoverflow.com/questions/52270177/how-to-use-predict-generator-on-new-images-keras/55991598#55991598

See this code snippet for a full implementation of binary image classification using a pre-trained model and incorporating a data generator function: https://github.com/Bixi81/Python-ml/blob/master/keras_pretrained_imagerec_binaryclass.py

More details can be found in the Keras docs: https://keras.io/api/preprocessing/image/

Correct answer by Peter on August 31, 2020

A couple of options:

  1. Rent a bigger computer on a cloud service.
  2. Move to a distributed computing framework (e.g., Spark).
  3. Use a data loading function that only loads the needed data.

Option #3 is the simplest. Most training does not need all of the data in memory at the same time.

Dask can be used for this type of image loading .

Answered by Brian Spiering on August 31, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP