Geographic Information Systems Asked by tynowell on March 28, 2021
I need to compare the values of two raster layers at a set of given points. The raster layers are at a national scale, split into tiles and are at 1m resolution (2+TB). The points number in the 100s of millions.
I have an implementation which works but is slow and consumes a lot of resources. Two virtual rasters of the layers are created and the points are split into ~1 million point chunks (arbitrarily selected). Rasterio is then used to sample the rasters at each point and the results (in vector format) are compared. The process takes about 20 mins / million points and tops out at about 10GB of RAM. I’ve parallelised the process but the limiting factor seems to be the RAM requirement. The CPU load is relatively low and the disk I/O does not appear to be a limiting factor.
The raster layers are already loaded in a local GeoServer. Using this to interface with the data should mitigate the RAM requirement and enable more parallel processes to run concurrently. However, I have been unable to find a way to sample a raster at a given point rather than downloading an area GeoTIFF.
I can go down the GeoTIFF route if that is the only option but it seem counterproductive to speeding up the process due to the fact that it must be written to disk first (even if its just a tempfile). So, I have two questions:
NB This process is taking place on a scalable GCP VM. If required it can be scaled up to 96 vCPUs and/or 624GB of RAM but I’d prefer not to brute force it and find a more elegant way. The scaling could apply to the GeoServers specs too.
If you want to query a point from WCS you can use subsets with slice points. You can read more about that from the WCS 2.0 Core standard and here is a GetCoverage example. GML may not be optimal outputformat but take this as a demonstration.
https://demo.geo-solutions.it/geoserver/wcs?SERVICE=WCS&REQUEST=GetCoverage&VERSION=2.0.1&CoverageId=nurc__mosaic&subset=Long(8.0)&subset=Lat(38)&format=gml
The interesting part of the output is this:
<gml:rangeSet>
<gml:DataBlock>
<gml:rangeParameters/>
<tupleList>42,38,79,255 </tupleList>
</gml:DataBlock>
</gml:rangeSet>
The tupleList presents in this case the values of R, G, B, and alpha band at given coordinates. Some WCS 2.0 servers may prefer subsets with ranges where lower limit = upper limit instead of slice points.
Correct answer by user30184 on March 28, 2021
Turns out that the RAM requirement, and the time to compute, does not scale linearly with the number of points. Subsequently, I am processing smaller batches of points in more threads and getting much faster speeds (~10 mins / million points / thread). I'm running 30 threads so ~20 secs / million points.
If anyone is interested, the code is posted below.
import os
import numpy as np
import geopandas as gpd
import rasterio as rio
from joblib import Parallel, delayed
def validate_package(package_slice, part):
def check_valid(geom_list, hgt=1.6):
"""
Check that at the given points, the value of raster_1 is at
least hgt more than raster_2
"""
points = [geom.coords[:][0] for geom in geom_list]
raster_1 = np.vstack(list(ds_raster_1.sample(xy=points)))
raster_2 = np.vstack(list(ds_raster_2.sample(xy=points)))
return raster_1 < raster_2 + hgt
ds_raster_1 = rio.open("/data/raster_1.vrt")
ds_raster_2 = rio.open("/data/raster_2.vrt")
gdf_in = gpd.read_file("/data/source_points.gpkg", driver="GPKG", rows=package_slice)
gdf_in['valid'] = check_valid(gdf_in.geometry)
gdf_out = gdf_in[gdf_in.valid].drop('valid', axis=1)
gdf_out.to_file(f"/data/valid_points_{part}.gpkg", layer=f"points_{part}", driver="GPKG")
return
# Define slices
file_size = ########
slices = [slice(r[0], r[-1]) for r in np.array_split(range(file_size), 500)]
# Process slices in parallel
Parallel(n_jobs=30)(delayed(validate_package)(package_slice, part) for part, package_slice in enumerate(slices))
Answered by tynowell on March 28, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP