Geographic Information Systems Asked on June 13, 2021
The setting
I have a Geopackage point layer in QGIS 3.18 with over 350.000 features for an area of about 1700 km² (extent ca. 50*60 km), representing centroids of buildings. The points contain an attribute with the construction year of the building: from 1000 to 2020. A few statistical values, based on Basic statistics for fields
, can be found below. CRS is EPSG:2056
(projected CRS for Switzerland, units=m).
What I want to do
The idea now is to find for each building the nearest building that is older and create an attribute nearest_older
with the fid
of this next older building – see the visualization at the bottom.
In a conceptual sense, it is similar to the concept of Topographic isolation: for a summit, find the minimum distance to a point of equal/higher elevation.
What I tried
I found an expression that works in principle, see it below. For some additional ideas how to proceed, see at the bottom as well.
The problem
The expression will not be efficient for such a huge data set (at least not in reasonable time). For testing, I reduced the computation time by: A) let it run only for the first 999 features (if (fid< 1000
) and B) limited the number of nearest features to take into consideration to 1000 (limit:=1000
). Even with this limitations, calculation runs for about 15 minutes.
The Question
How could the expression be improved for performance (calculation time)?
There will be probably no ideal solution, just an optimal one. So I’m interested in a solution that significantly reduces the calclulation time in a way that it becomes operationable for my dataset.
The expression
if (
fid< 1000,
with_variable (
'cy',
"construction_year" ,
array_first (
array_filter (
overlay_nearest(
@layer,
$id,
limit:=1000
),
attribute (
get_feature_by_id (
@layer,
@element
),
'construction_year'
) < @cy
)
)
),
'')
Explanation how the expression works
overlay_nearest( )
get an ordered array of the id
for the x nearest neighbours (x here is set to 1000)
array_filter ()
: filter this array in a way that only elements (@element
) are retained where the attribute ()
named construction_year
is smaller than the variable @cy
– and this variable is defined as the construction year of the current feature. So the filter returns for each feature an array of id’s, where the construction year is older than the one of the current feature.
array_first ()
returns the first id of this value. As overlay_nearest
from step 1 orders the array based on closeness to the current feature, this is the id of the nearest_older
I am looking for.
Further ideas how to improve performance
Here are a few ideas I had how to make the expression more efficient. All of these approaches have some shortcomings, however. And particularly, I’m stuck how to combine them in the most efficient way. And probably there are other approaches, not considered yet.
Run it sequentially, like in the if-clause above: for features 1 to 999, than for 999 to 1999 and so on. Not very efficient, however.
limit:=1000
to reduce the number of elements in the array created by overlay_nearest()
to 1000 is unflexible: the newer a building is, the higher chances are that very close to it, you’ll find one that is older. Thus, for the majority of the buildings (who are constructed in the last few decades), you won’t need to identify a fixed number of 1000 nearest neighbors – a number of 50 or 100 or so would be OK. So the fixed value 1000
could be replaced by a formula that returns an inverse proportional value regarding the construction year. However, how to get an "optimum" formula, based on the distribution of the values in my field construction_year
? For this, compare the statistical values below.
Not for every feature a "match" has to be found in one pass. For some features, in the first round and using the limits defined, no matching id
with an older building could be found. These NULL value cases could be calculated (based on a condition if NULL
) in a next iteration. So an iterative approach could be used – but how to set it up?
A few statistical values
These values are based on Basic statistics for fields
for the field construction_year
Count: 336856
Unique values: 629
Minimum: 1000
Maximum: 2020
Range: 1020
Mean: 1954
Median: 1971
Standard deviation: 64.5
Coefficient of Variation: 0.033
Minority (rarest occurring value): 1004
Majority (most frequently occurring value): 1980
First quartile: 1935
Third quartile: 1994
Interquartile Range (IQR): 59
Histogram, created with DataPlotly plugin: distribution of construction years, with the highest peak at 1980/81, lower peaks at 1928-1931 etc.:
Visualization
This shows the principle: each point is labeled with its construction year and connected by a red arrow to the nearest point with an older construction year. The point labeled with 1958
at the very bottom is connected to a point with label 1940
, even though it has four neighboring points at a nearer distance, but with newer construction date: 1986
, 1969
, 1996
and 1960
– so it goes on until the first (nearest) point is found with an older construction date:
Here is a more universal processing tool for this task with some advantages over my basic script:
# V1.3.2
from PyQt5.QtCore import QCoreApplication, QVariant
from qgis.core import (QgsSpatialIndex, QgsProcessingParameterFeatureSink, QgsFeatureSink, QgsField, QgsFields, QgsFeature, QgsGeometry, QgsPoint, QgsWkbTypes,
QgsProcessingAlgorithm, QgsProcessingParameterField, QgsProcessingParameterBoolean, QgsProcessingParameterVectorLayer, QgsProcessingOutputVectorLayer, QgsProcessingParameterEnum, QgsProcessingParameterNumber)
import operator
class NearNeighborAttributeByAttributeComparison(QgsProcessingAlgorithm):
SOURCE_LYR = 'SOURCE_LYR'
SOURCE_FIELD = 'ID_FIELD'
ATTRIBUTE_FIELD = 'ATTRIBUTE_FIELD'
MAX_NEIGHBORS = 'MAX_NEIGHBORS'
MAX_DISTANCE = 'MAX_DISTANCE'
DONOT_COMPARE_VALUE = 'DONOT_COMPARE_VALUE'
DONOT_COMPARE_BOOL = 'DONOT_COMPARE_BOOL'
OPERATOR = 'OPERATOR'
OUTPUT = 'OUTPUT'
def initAlgorithm(self, config=None):
self.addParameter(
QgsProcessingParameterVectorLayer(
self.SOURCE_LYR, self.tr('Source Layer'))) # Take any source layer
self.addParameter(
QgsProcessingParameterField(
self.SOURCE_FIELD, self.tr('Attribute field containing unique IDs'),'id','SOURCE_LYR'))
self.addParameter(
QgsProcessingParameterField(
self.ATTRIBUTE_FIELD, self.tr('Attribute field for comparison'),'year','SOURCE_LYR'))
self.addParameter(
QgsProcessingParameterNumber(
self.MAX_NEIGHBORS, self.tr('Maximum number of nearest neighbors to compare (use -1 to compare all features of the layer)'),defaultValue=1000,minValue=-1))
self.addParameter(
QgsProcessingParameterNumber(
self.MAX_DISTANCE, self.tr('Only take nearest neighbors within this maximum distance into account for comparison'),defaultValue=10000,minValue=0))
self.addParameter(
QgsProcessingParameterEnum(
self.OPERATOR, self.tr('Operator to compare the attribute value (If attribute is of type string, only == and != do work)'),
['<','<=','==','!=','>=','>'],defaultValue=0))
self.addParameter(
QgsProcessingParameterNumber(
self.DONOT_COMPARE_VALUE, self.tr('Do not search for matches on features having a value (insert chosen operator here) x n Only works for numerical values and dependent on the chosen operator. n Typically this should be the max or min value available in the attributes and therefore there cant be a match.'),defaultValue=0))
self.addParameter(
QgsProcessingParameterBoolean(
self.DONOT_COMPARE_BOOL,self.tr('Check this Box to actually use the previous option ("Do not search for matches on ....")'),defaultValue=0))
self.addParameter(
QgsProcessingParameterFeatureSink(
self.OUTPUT, self.tr('Near Neighbor Attributes'))) # Output
def processAlgorithm(self, parameters, context, feedback):
# Get Parameters and assign to variable to work with
layer = self.parameterAsLayer(parameters, self.SOURCE_LYR, context)
idfield = self.parameterAsString(parameters, self.SOURCE_FIELD, context)
idfield_index = layer.fields().indexFromName(idfield) # get the fieldindex of the id field
idfield_type = layer.fields()[idfield_index].type() # get the fieldtype of this field
attrfield = self.parameterAsString(parameters, self.ATTRIBUTE_FIELD, context)
attrfield_index = layer.fields().indexFromName(attrfield) # get the fieldindex of the attribute field
attrfield_type = layer.fields()[attrfield_index].type() # get the fieldtype of this field
maxneighbors = self.parameterAsDouble(parameters, self.MAX_NEIGHBORS, context)
maxdistance = self.parameterAsDouble(parameters, self.MAX_DISTANCE, context)
donotcomparevalue = self.parameterAsDouble(parameters, self.DONOT_COMPARE_VALUE, context)
donotcomparebool = self.parameterAsBool(parameters, self.DONOT_COMPARE_BOOL, context)
op = self.parameterAsString(parameters, self.OPERATOR, context)
op = int(op[0]) # get the index of the chosen operator
#import operator
ops = { # get the operator by this index
0: operator.lt,
1: operator.le,
2: operator.eq,
3: operator.ne,
4: operator.ge,
5: operator.gt
}
op_func = ops[op] # create the operator function
total = 100.0 / layer.featureCount() if layer.featureCount() else 0 # Initialize progress for progressbar
# if -1 has been chosen for maximum features to compare, use the amount of features of the layer, else use the given input
if maxneighbors == -1:
maxneighbors = layer.featureCount()
fields = layer.fields() # get all fields of the inputlayer
fields.append(QgsField("near_id", idfield_type)) # create new field with same type as the inputfield
fields.append(QgsField("near_attr", attrfield_type)) # same here for the attribute field
fields.append(QgsField("near_dist", QVariant.Double, len=20, prec=5)) # add a new field of type double
idx = QgsSpatialIndex(layer.getFeatures()) # create a spatial index
(sink, dest_id) = self.parameterAsSink(parameters, self.OUTPUT, context,
fields, layer.wkbType(),
layer.sourceCrs())
for current, feat in enumerate(layer.getFeatures()): # iterate over source
new_feat = QgsFeature(fields) # copy source fields + appended
attridx = 0 # reset attribute fieldindex
for attr in feat.attributes(): # iterate over attributes of source layer for the current feature
new_feat[attridx] = attr # copy attribute values over to the new layer
attridx += 1 # go to the next field
new_feat.setGeometry(feat.geometry()) # copy over the geometry of the source feature
if ((not(op_func(feat[attrfield], donotcomparevalue))) or (not donotcomparebool)): # only search for matches if not beeing told to not do to so
nearestneighbors = idx.nearestNeighbor(feat.geometry(), neighbors=maxneighbors, maxDistance=maxdistance) # get the featureids of the maximum specified number of near neighbors within a maximum distance
try:
nearestneighbors.remove(feat.id()) # remove the current feature from this list (otherwise the nearest feature by == operator would always be itself...)
except:
pass # ignore on error
for near in nearestneighbors: # for each feature iterate over the nearest ones (the index is already sorted by distance, so the first match will be the nearest match)
if op_func(layer.getFeature(near)[attrfield], feat[attrfield]): # if the current nearest attribute is (chosen operator here) than the current feature ones, then
new_feat['near_id'] = layer.getFeature(near)[idfield] # get the near matchs's id value and fill the current feature with its value
new_feat['near_attr'] = layer.getFeature(near)[attrfield] # also get the attribute value of this near feature
new_feat['near_dist'] = feat.geometry().distance(layer.getFeature(near).geometry()) # and finally calculate the distance between the current feature and the nearest matching feature
break # break the for loop of near features and continue with the next feat
else: # do not search for near neighbor matches if given value is (operator here) than x
pass # do nothing and continue adding the feature
sink.addFeature(new_feat, QgsFeatureSink.FastInsert) # add feature to the output
feedback.setProgress(int(current * total)) # Set Progress in Progressbar
if feedback.isCanceled(): # Cancel algorithm if button is pressed
break
return {self.OUTPUT: dest_id} # Return result of algorithm
def tr(self, string):
return QCoreApplication.translate('Processing', string)
def createInstance(self):
return NearNeighborAttributeByAttributeComparison()
def name(self):
return 'NearNeighborAttributeByAttributeComparison'
def displayName(self):
return self.tr('Add near neighbor attribute by comparing attribute values')
def group(self):
return self.tr('FROM GISSE')
def groupId(self):
return 'from_gisse'
def shortHelpString(self):
return self.tr(
'This Algorithm searches the n'
'- x nearest neighbors n'
'- within a given maximum distance n'
'of the current feature and compares a given attribute. n'
'If this comparison returns true, it adds the id, and the attribute of this neighbor to the current feature as well as the distance to this neighbor. n n '
'Further explanations available on https://gis.stackexchange.com/a/396856/107424'
)
To use it, just copy paste the code without any changes and place the python file in C:UsersyourusernameAppDataRoamingQGISQGIS3profilesdefaultprocessingscripts
.
See the docs for more informations. Once saved, you will find it in your processing toolbox within "Scripts" -> "FROM GISSE"
Only tested in QGIS 3.18.2. I can already tell this wont work in QGIS 3.2 (as intended), as this version has no option to limit the maximum search distance. Don't know when this feature was introduced though. However, feel free to just test in your version, worst thing that can happen is an error or a crash. In this case, upgrade your QGIS or edit the script accordingly.
Some further explanations on this script and the used method:
First, a spatial index is built on the layer. Then we iterate over the whole layer. For each "actual feature " (naming it this way to not confuse with my used term "near feature") we are getting the nearest neighbors of the current actual feature by using the QgsSpatialIndex().nearestNeighbor(<current_point>,<maximum_neighbors>,<maximum_distance>)
method. This method returns an, by distance ascendingly, ordered list of the nearest feature id's. Then, to prevent iterating over itself and therefore finding itself when using the ==
operator, we remove the current actual feature's id from this list.
Now we iterate of this ordered list of near feature id's (if there is no given limitation by using the "Do not search for matches on..." option). So first we get the very closest other feature by its id. Check if it fulfills our given attribute comparison criteria. If not, we go to the second nearest feature. Check again. Go to the third nearest... and so on. If we finally find the first one fulfilling the criteria, we are grabbing the two attributes by a focused layer.getFeature(<near_matching_feat_id>)[<field>]
request and calculate the distance by using <current_actual_feat>.geometry().distance(layer.getFeature(<near_matching_feat_id>).geometry())
. These two operations should not cost anything worth mentioning the calculation time. Also these things are always be done only once for an actual feature as we do not iterate here. As soon as we have done this, we stop (break
command) the iteration over the ordered list of near features and go to the next actual feature.
Runtime and optimal settings:
So if I am not completely mistaken, the calculation time in the best case should be: t = time_needed_for((number_of_actual_features-1)*1) + time_needed_for(creating_index)
. The best case is, if for all actual features the very closest other (near) feature fulfills the given attribute comparison criteria. In the worst case it should be t = time_needed_for((number_of_actual_features-1)^2) + time_needed_for(creating_index)
. The worst case is, if for all actual features the very farthest other (near) feature fulfills the given attribute comparison criteria. So we have three options to prevent the worst case from happening: limit the maximum number of near features to compare (when limiting to max. 1000 comparisons the worst case is t = time_needed_for((number_of_actual_features-1)*1000) + time_needed_for(creating_index)
if there are at least 1000 actual features) or less helpful, but still, limit the maximum distance of near features to compare. The second option is less helpful because in this distance, theoretically, still all other near features could be included. As third option we can limit the search by giving an attribute. This makes sense when using the available minimum attribute here when using < or <= operator. Or when using the available maximum attribute when using > or >= operator. This option prevents from searching for a match if the criterion is fulfilled. So e.g. when using < and the minimum year available, prevent the algorithm from searching where there cannot be a match anyway in that case.
So the optimal settings strongly depend on your individual layer and your individual desired result. On the individual layer, because the runtime is dependent on the spatial and attributional distribution of your features. The algorithm searches for the nearest feature, matching the attribute condition, ordered by their distance. If everytime the very nearest feature already is a match, the runtime is short. If not, the more features need to be compared, the greater the runtime is. Here comes your personal desired result in play: you can limit the number of features to compare and/or their maximum distance. Lets say you limit the number of features to compare to 100: If there is no attributional match within the 100 closest features, the loop will be skipped and the current actual feature will simply get no result. So no result, but increased runtime. But if there is a match at lets say the 51st closest feature, this setting does have absolutely no effect, as the loop is skipped anyway after the 51st comparison.
Here an example I have tested; given 1000 randomly distributed points within a given extent of 10km*10km and random years between 1850 and 2020. Not using attribute search limitations in this test.
The newly implemented option in V1.3 ("Do not search for matches on ...") can prevent a bad case (like in the last example setting): When there can't be a match, it can prevent from searching at all in that case. But this option can also be useful in other cases where one wants to limit the match searching.
Possible improvements:
So far, not beeing an expert in Python, I think the performance maybe could be improved by using an attribute index. This could allow to directly filter out (non) matches. However, my question if such a thing exists is still unanswered, and I am not sure if my "workaround" would increase the performance a lot. However, I am not sure if an attribute index improves performance here in combination with a spatial index. (But thinking about the whole thing again, the performance and result can be improved by adding another input, where you would manually add a year, where below no search at all would be done, e.g. the minimum year where there cant be a result anyways. --> just implemented that one in V1.3)
Surley there may be other things as well, a Python expert can optimize on this script. Processing tools run as background threads, thats why QGIS is still responsive while executing it.
Why not using expressions here?
Now lets come to the question on why expressions are a lot slower here: To my opinion/knowledge the reasons are: they are not using a spatial index and especially, the calculation is feature based. So basically overlay_nearest()
is aggregating the whole layer as many times as the layer features has. The big difference to the Python script is, that we are only building an spatial index once and do not aggregate anything. We get the nearest neighbors by using this index on each iteration. As already said, I am not 100% sure here, so I welcome if my statement can be confirmed or disproved.
Correct answer by MrXsquared on June 13, 2021
If you are open to a PyQGIS solution you can use this script. It takes about two seconds on my test layer with 1000 features.
layer = iface.activeLayer() # get the layer
layer.dataProvider().addAttributes([QgsField('nearest_older_fid', QVariant.Int)]) # add new field
layer.updateFields() # update fields so we can work with the new one
idx = QgsSpatialIndex(layer.getFeatures()) # create a spatial index
nfeats = layer.featureCount() # count total features of the layer
with edit(layer): # edit the layer
for feat in layer.getFeatures(): # iterate over the layers features
for near in idx.nearestNeighbor(feat.geometry(), nfeats): # for each feature iterate over the nearest ones
if layer.getFeature(near)['year'] < feat['year']: # if the current nearest ones is smaller than the current feature ones
feat['nearest_older_fid'] = layer.getFeature(near)['id'] # get the nearest featureid and fill the current feature with its value
# Alternatively you could also use:
# feat['nearest_older_fid'] = near # this will not return the attribute "id" but directly the featureid of the nearest feature.
layer.updateFeature(feat) # update the feature
break # break the for loop of near features and continue with the next feat
You may need to adjust fieldnames.
The key elements are:
Demo:
Answered by MrXsquared on June 13, 2021
Testing-Results: best settings for the script by MrXsquared
Finally, I can present the results of my testing series, based on the python-script (accepted answer) by MrXsquared. To anticipate the result: a setting of max_distance 500 meters and max_neigbors of 50 was the most performant one. The shortest time to find all results is an iterative approach, with relative low seetings for the first run and than using increased values for the remaining feautres where no "nearest older neighbor" was found in the first run.
It took just a bit over half an hour in three iterations to get a "nearest older" neighbor for all points in my dataset. Running the script did not block use of windows or even QGIS. Wile the script was running (with no other activity on the machinge), utilized capazity was about 32%. Details below.
Influence of the computer used on performance
However: the influence of the machine you use should not be underestimated and even could affect the performance more than the seetings used. I ran all the tests on the same machine (8 CPU, 2GHz, 16 GB RAM). However, to compare, I ran the "winnig setting" on another (older) machine as well (4 CPU, 3.33 GHz, 8 GB RAM): there it took 2114 seconds (instead of 1772, thus more than 19% longer) to get the same result (test no. 8 below in the table).
The basiscs
My layer consists of 336.856 point features. The idea was to identify the maximum possible nearest neighbors in a minimum of time: to make calculation as effective as possible. I tested different setting for maximum distance and maximum number of nearest neigbors to compare.
The tests
Below you see the table of the results, running the script on the same set of 336.856 points with 22 diferent values for max_distance and max_neighbors. The columns show: settings, time it took to calculate in seconds as well as in min:sec., numer of features for which a "nearest older" was found in absolute value and percentage of the whole data set and the same for number of "not found nearest older neighbor" (with used settings). The last two columns are a mean calculated as per number of items found per second and the inverse: mean time it took to find 1000 items in seconds.
The following graphic (done in QGIS, by the way) shows the result for all but the last run. The number in the label corresponds to the no
value (first column) of the table below, where you can see the details for each entry. The size of the symbols corresponds to the value of the max_distance
setting (not linear!), the color corresponds to the number of max_neigbors
: the darker, the higher is the value for maximum neighbors to consider:
no max_dist max_neighbors duration in sec min:sec found found in % not found not found in % found/s s/1000 found
1 100 50 1619 26:59 304395 90.36 32461 9.64 188 5.32
2 100 100 1676 27:56 306313 90.93 30543 9.07 182.8 5.47
3 100 200 1727 28:47 306413 90.96 30443 9.04 177.4 5.64
4 125 50 1722 28:42 311261 92.4 25595 7.6 180.8 5.53
5 150 50 1734 28:54 314404 93.33 22452 6.67 181.3 5.52
6 175 50 1752 29:12 316167 93.86 20689 6.14 180.5 5.54
7 250 50 1753 29:13 319179 94.75 17677 5.25 182.1 5.49
8 500 50 1772 29:32 322258 95.67 14598 4.33 181.9 5.5
9 1000 50 1820 30:20 322906 95.86 13950 4.14 177.4 5.64
10 175 100 1929 32:09 322563 95.76 14293 4.24 167.2 5.98
11 200 100 1953 32:33 324417 96.31 12439 3.69 166.1 6.02
12 250 100 1986 33:06 326600 96.96 10256 3.04 164.5 6.08
13 500 100 2001 33:21 330164 98.01 6692 1.99 165 6.06
14 5000 100 2035 33:55 331107 98.29 5749 1.71 162.7 6.15
15 1000 100 2049 34:09 331055 98.3 5801 1.7 161.6 6.19
16 1000 200 2246 37:26 334165 99.2 2691 0.8 148.8 6.72
17 2000 200 2254 37:34 334261 99.23 2595 0.77 148.3 6.74
18 1000 500 2557 42:37 335686 99.65 1170 0.35 131.3 7.62
19 5000 500 2611 43:31 335888 99.71 968 0.29 128.6 7.77
20 10000 1000 2853 47:33 336379 99.86 477 0.14 117.9 8.48
21 50000 1000 2889 48:09 336379 99.86 477 0.14 116.4 8.59
22 10000 2000 3289 54:49 336615 99.93 241 0.07 102.3 9.77
Identifying the most efficient setting
So which one is the most efficient setting? It should be placed as much to the upper left as possible (minimum time, maximum % of items found). Connecting points 1 to 10, you can see an S-shaped curve. Points 1 and 8 seem to be more performant, as all the others are to the right side of the connecting line 1 to 8. Point 1 has the best overall ratio for found/s
, but setting 8 finds substantially more points (5 percentage points more) with not so much more calculation time: it increases only 2 and a half minute (9%).
Points 9 and 10 do not substantially improve the result, to the contrary, they just take longer. Points 12 to 21 (including 22, which is not on the graphic) describe a kind of asymptotic curve, thus the effort to find additional results increases disproportionately. So these points can't be considered to represent an efficient solution.
The winner and further iteration
So the most efficient solution is the one numbered here with 8: it finds a nearest older neighbor for 95.67% of the 336.856 features in just under half an hour (29:32 min). So from all the tested settings, it has the best time/found-ratio.
First run: see settings for test no. 8 in te table above.
Senod run: For the remaining 14.598 "not found" features, I ran the script a second time with these settings/results: max. no. nearest neighbors: 1000 max. distance: 100.000 (100 km) duration: 194 sec result: nearest neighbor found for all but 21 of 14.598 features
Third run: For the last 21 features, I ran the script a third time with the same settings as for the second run: max. no. nearest neighbors: 1000 max. distance: 100.000 (100 km) duration: 0.12 sec result: nearest neighbor found for all but 2 of 21 features
So in this third run, all features were found. The remaining two are the two oldest building of the data set, so for them, no "nearest older" exists.
Conclusion
Thus, the calculation for the whole data-set in three runs took: 1772 + 194 + 0 = 1966 sec. or 32:46 min. or a bit more than half an hour. So for the whole original features of 336.856 points that means as a mean value: 171.3 features calculated per second or 0.0058 sec. for calculation of one feature.
In my context, this is quite a good result. That's why I think the answer is well worth it's bounty.
Answered by Babel on June 13, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP