Geographic Information Systems Asked by K Kochanski on December 17, 2020
I have a time-series of rasters (daily maps of global temperature for the past year), and a shape-file outlining state boundaries.
I’d like to find the average temperature in each state on each day of the previous year, for a result like:
Alabama 01-01-2019 : 5.2 degrees
Alabama 01-02-2019 : 3.6 degrees
...
Wyoming 12-30-2019 : -3.1 degrees
Wyoming 12-31-2019 : -2.0 degrees
How would I do this in R? (Other project requirements prevent me from using Python or QGIS.)
The format I’ve tried is:
shape = st_read("./states_shp")
temp_raster = brick("temperature_data.nc")
mean_temps = exact_extract(temp_raster, shape, 'mean')
This successfully masks my raster data using the shape file, but the 'mean'
operation performs a temporal mean. Is there a way I can modify this to get the spatial mean across each state while keeping the time dimension?
Edit: following a suggestion below, here’s the data. The shapefile is from the US Census (https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html):
> library(sf)
> shape = st_read("cb_2018_us_state_500k/cb_2018_us_state_500k.shp")
> head(shape)
Simple feature collection with 6 features and 9 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -103.0026 ymin: 28.92861 xmax: -75.24227 ymax: 40.6388
geographic CRS: NAD83
STATEFP STATENS AFFGEOID GEOID STUSPS NAME LSAD ALAND
1 28 01779790 0400000US28 28 MS Mississippi 00 121533519481
2 37 01027616 0400000US37 37 NC North Carolina 00 125923656064
3 40 01102857 0400000US40 40 OK Oklahoma 00 177662925723
AWATER geometry
1 3926919758 MULTIPOLYGON (((-88.50297 3...
2 13466071395 MULTIPOLYGON (((-75.72681 3...
3 3374587997 MULTIPOLYGON (((-103.0026 3...
(I accidentally wrote this about precipitation, but the same ideas will carry over.)
What you're looking for is called zonal statistics.
Many packages do zonal statistics slowly by doing polygon overlap checks for each cell in your gridded data. However, if you're working with gridded data it's possible to perform these calculations much quicker by tracking which cells a polygon's boundary traverses (see here for more details). Many packages that don't use performant methods fudge their results by checking to see if there's any overlap between a grid cell and a polygon. If there is overlap then the entire value of the cell is assigned to the polygon. A more accurate method would calculate the percentage overlap.
The method I detail below is both performant and accurate.
To demonstrate how to do it, let's acquire some climate data from MACA, specifically MACAv2-LIVNEH monthly data from the bcc-csm1-1 model over the historical 1990-2005 period for pr (precipitation), tasmin (minimum temperature), and tasmax (maximum temperature).
This data gives monthly values for climate variables on gridded data covering the US.
We can acquire the boundaries of the US counties from the US Census. I'll use the 5m data for speed.
Now, we can achieve your zonal statistics as follows:
#Installation command, if needed
#install.packages(c("raster", "ncdf4", "sf", "exactextractr", "matrixStats"))
#Alternative installation in case package is unavailable on CRAN (unlikely)
#devtools:::install_github("isciences/exactextractr")
library(exactextractr) #For `exact_extract`
library(matrixStats) #For `colWeightedMeans`
library(raster) #For `brick`
library(sf) #For `st_read`
library(stringr) #For `str_sub`
library(tidyr)
#Read in climate data
precip_fname = "macav2livneh_pr_bcc-csm1-1_r1i1p1_historical_1990_2005_CONUS_monthly.nc"
tasmin_fname = "macav2livneh_tasmin_bcc-csm1-1_r1i1p1_historical_1990_2005_CONUS_monthly.nc"
tasmax_fname = "macav2livneh_tasmax_bcc-csm1-1_r1i1p1_historical_1990_2005_CONUS_monthly.nc"
#A brick is a stack of rasters all of which have the same spatial extent and CRS
precip = raster:::brick(precip_fname)
tasmin = raster:::brick(tasmin_fname)
tasmax = raster:::brick(tasmax_fname)
#Move from [0,360) system to [-180,180) system.
#Ignore warning: "this does not look like an appropriate object for this function"
precip = raster:::rotate(precip)
tasmin = raster:::rotate(tasmin)
tasmax = raster:::rotate(tasmax)
#Read in county outlines
shape_orig = st_read("cb_2018_us_county_5m.shp")
#Function for examining the output
sanity_check = function(values, coverage){
print(length(values))
print(head(values))
print(head(coverage))
print(sum(0.0625*0.0625*111*111*coverage))
stop()
}
exact_extract(precip, shape, fun=sanity_check, stack_apply=FALSE)
As you can see, I've written the above so that the function we're using for the zonal statistics stops on its first call while printing out stuff. This gives us an opportunity to sanity check. It can also be helpful in debugging/experimenting with functions as we go. Alternatively, we could capture the values passed to the function into globals:
bob = NA
sanity_check = function(values, coverage){
bob <<- values # Capture `values` to global `bob` using the special `<<-` operator
stop()
}
The output of values looks like this
X1990.01.15 X1990.02.15 X1990.03.15 X1990.04.15 X1990.05.15 X1990.06.15
1 40.08998 104.5751 64.49851 125.1040 77.43653 112.2190
2 37.99795 109.5864 65.67377 129.4187 73.84611 108.2572
3 36.30927 106.8414 61.52228 126.5196 70.75780 108.2304
4 35.39827 109.0072 61.63351 128.3016 72.66797 110.2836
5 32.59171 101.9070 60.04919 130.0065 69.71605 110.9107
6 43.30358 107.1016 63.17825 128.2818 78.63950 112.5901
and the output of coverage looks like this
0.029508008 0.054003537 0.047953043 0.041902550 0.001358363 0.015525309
values
has 192 rows while coverage has 192
entries. These coorespond to the 192 cells in the raster dataset which intersect the first polygon.
That first polygon, obtained via head(shape)
, is from Highland County, OH. Google gives it an area of 1445.21 square kilometers.
The coverage
values indicate what percentage of each cell overlaps the polygon of interest. Checking print(precip)
we see
geospatial_lat_resolution: 0.0625
geospatial_lon_resolution: 0.0625
we can therefore estimate the total area of the cells covering the first polygon as (assuming that one degree of latitude and longitude are both 111km)
print(sum(0.0625*0.0625*111*111*coverage))
which gives 1857.395 km^2. If we were using a better approximation of the length of a degree of longitude, say 86.6km, we'd get 1449.103 square kilometers, which is almost exactly right.
So, if we want the average monthly precipitation in the polygon we would assume water falls homogeneously in each cell of the precip
raster and find the spatial average of precipitation across the whole polygon. Therefore, we need a weighted average of each column with the coverage
vector:
get_precip = function(values, coverage){
values %>% summarize(across(everything(), ~ weighted.mean(.x, coverage, na.rm=TRUE)))
}
shape = head(shape_orig) #For speed, look at only some polygons
out = exact_extract(precip, shape, fun=get_precip, stack_apply=FALSE)
out
is now a dataframe with a column for each month in the raster dataset
X1990.01.15 X1990.02.15 X1990.03.15 X1990.04.15 X1990.05.15 X1990.06.15
1 41.00624 119.80429 66.97151 134.63554 82.209904 97.742532
2 100.64820 30.94722 311.52611 68.78153 9.951976 2.895623
3 124.48859 269.42761 91.15898 111.15357 28.810363 42.215396
4 53.80008 110.29271 61.91393 102.35224 136.917751 113.676423
5 83.20906 323.28954 117.29841 134.12935 15.661511 59.689629
6 51.07177 241.49114 96.58443 195.22793 74.009000 21.333790
but each row corresponds to one of the polygons in shape
.
We can join this to our spatial data with cbind
:
shape_out = cbind(shape_orig, out)
A lot happened in get_precip
. Let's break it down slightly: values
got piped %>%
via dplyr into summarize
where we asked for a summary value from each column (across(everything()
) and that summary value was calculated as the weighted.mean
of the values in the column .x
and the coverage
vector (dplyr always treats columns as vectors). We chose to drop missing data values from the dataset with na.rm=TRUE
. All of that returned a data frame with one column for each month (because we didn't drop or combine any columns from the input data frame values
). The combined output of all the polygons is the rbind
of these data frames.
Now, what if we wanted the spatial average of the total yearly precipitation? For that we'd want to sum together all the months of the year and then get a weighted spatial average:
get_precip = function(values, coverage){
values %>% mutate(id=1:n()) %>% # Give each region a unique numeric id
reshape2::melt(id.var='id') %>% # Convert from wide format to long format. Table now has columns id, variable, value where variable is, e.g., X1990.01.15
mutate(year=str_sub(variable,2,5)) %>% # Extract year from the variable column
select(-variable) %>% # Drop variable column
group_by(year, id) %>% # Group by year and id
summarize(avg=sum(value, na.rm=TRUE)) %>% # For each year in each region, get the total precipitation
reshape2::dcast(id~year) %>% # Convert back to wide format where each column is a year
select(-id) %>% # Drop id column
summarize(across(everything(), ~ weighted.mean(.x, coverage, na.rm=TRUE))) # Get spatially-weighted average
}
shape = head(shape_orig) #For speed, look at only some polygons
out = exact_extract(precip, shape, fun=get_precip, stack_apply=FALSE)
#Combine with shape
shape_out = cbind(head(shape_orig), out)
You might be bothered wondering if perhaps the output of exact_extract
is in a different order than shape
; fortunately, the package author says the output order is the same as the input order.
Correct answer by Richard on December 17, 2020
When asking an R question, please include some example data like this (mostly taken from ?raster::extract
:
library(raster)
cds1 <- rbind(c(-180,-20), c(-160,5), c(-60, 0), c(-160,-60), c(-180,-20))
cds2 <- rbind(c(80,0), c(100,60), c(120,0), c(120,-55), c(80,0))
polys <- spPolygons(cds1, cds2)
b <- brick(nl=3)
values(b) <- 1:prod(ncell(b), nlayers(b))
Solution
v <- extract(b, polys, fun=mean, na.rm=TRUE)
v
# layer.1 layer.2 layer.3
#[1,] 39277.99 104077.99 168878.0
#[2,] 31904.29 96704.29 161504.3
Or more exact(r):
vv <- extract(b, polys, weights=TRUE, fun=mean, na.rm=TRUE)
vv
layer.1 layer.2 layer.3
[1,] 39266.06 104066.06 168866.1
[2,] 31903.33 96703.33 161503.3
With each row a polygon (state) and each column a time step.
Or with exactrextractr
library(exactrextractr)
p <- as(polys, "sf")
ev <- exact_extract(b, p, "sum")
# mean.layer.1 mean.layer.2 mean.layer.3
#1 39267.17 104067.16 168867.2
#2 31903.69 96703.69 161503.7
So that works fine, contrary to what you say.
You can also do this "by hand"
e <- exact_extract(b, p)
f <- function(x) {
v <- colSums(x * x$coverage_fraction)
v <- v / v[length(v)]
v[-length(v)]
}
ve <- sapply(e, f)
t(ve)
# layer.1 layer.2 layer.3
#[1,] 39764.59 105385.46 171006.3
#[2,] 32358.81 98083.23 163807.6
Answered by Robert Hijmans on December 17, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP