Python code speed execution problem

Question

I am trying to retrieve data from 324 eopatches I sampled beforehand. Patches are basically numpy arrays as explained in this documentation EOPatch. 
This is an example of the content of a patch after I resampled it :

EOPatch(
  data: {
    FEATURES: FeatureIO(/data/FEATURES.npy)
    FEATURES_SAMPLED: FeatureIO(/data/FEATURES_SAMPLED.npy)
  }
  mask: {}
  scalar: {}
  label: {}
  vector: {}
  data_timeless: {}
  mask_timeless: {
    LULC: FeatureIO(/mask_timeless/LULC.npy)
    LULC_ERODED: FeatureIO(/mask_timeless/LULC_ERODED.npy)
    LULC_ERODED_SAMPLED: FeatureIO(/mask_timeless/LULC_ERODED_SAMPLED.npy)
  }
  scalar_timeless: {}
  label_timeless: {}
  vector_timeless: {}
  meta_info: {}
  bbox: BBox(((500077.09501641133, 5095402.981379905), (501746.98615037295, 5097085.523204274)), crs=CRS('32633'))
  timestamp: [datetime.datetime(2017, 1, 1, 0, 0), ..., datetime.datetime(2017, 12, 19, 0, 0)], length=23
)

To retrieve certain data from the patch as my training and testing data. I first loaded the 324 patches like this:

# load sampled eopatches
eopatches = []
path_out_sampled = './eopatches_sampled_Slovenia'

for idx in range(len(patchIDs)):
    eopatches.append(EOPatch.load(f'{path_out_sampled}/eopatch_{idx}', lazy_loading=True))

eopatches = np.array(eopatches)

And then I used :

labels_train = np.array([eopatch.mask_timeless['LULC_ERODED_SAMPLED'] for eopatch in eopatches[patchIDs]])
features_train = np.array([eopatch.data['FEATURES_SAMPLED'] for eopatch in eopatches[patchIDs]])

However, it was taking forever executing and it was taking up so much RAM.

patchIDs is a list of numbers ranging from 0 to 323.

I tried to use this instead:

def data_retrieval(eopatch):
  feature_data=[]
  label_data=[]
  for eopatch in eopatches:
    f=eopatch.data['FEATURES_SAMPLED']
    feature_data.append(f)
    l=eopatch.mask_timeless['LULC_ERODED_SAMPLED']
    label_data.append(l)
  feature_data=np.array(feature_data)
  label_data=np.array(label_data)
  return feature_data, label_data

I just ended up with the same result.

I have also reinitialized the environment multiple times but nothing changed and I am using GPU.
Using h5py didn't help as well. I was able to load only 20 patches out of the 324.

Modern geoSystems · Answer

I suspect that you really have a large dataset underneath your patch objects, and the lazy loading is masking that until you start to retrieve specific subsets. I suggest you use a tool more suited to large datasets, such as h5py. Hopefully that or a similar tool will still meet your requirements.

Answered by Modern geoSystems on January 30, 2021

Python code speed execution problem

One Answer

Add your own answers!

Ask a Question