QC Adventures in QGIS
Last week my goal was to prototype one base feature layer and using the lessons learned to define a standard process for all other layers. Part of this task was to define quality control standards and identify candidate tools for their enforcement. This sub-task ended up consuming most of my time but was a valuable instruction in employing the tools available in QGIS. This post represents a summary of what I learned and an evaluation of the tools used.
What Drove Me To This Madness
One of my medium-to-long-term objectives is to create a series of standard layers which could act as the starting point for value-added datasets as well as supply base map features. Because I am drawing from a variety of sources, each using different data acquisition techniques, storage formats, specifications and standards there will inevitably be inconsistencies between layers that need to be reconciled. For example Natural Earth (http://www.naturalearthdata.com/) is the go-to resource for downloads if you need to display boundaries, coastlines and major water bodies at continental scales. This site also supplies administrative boundaries immediately below the national level (i.e. provinces or states) – but not anything as small as counties or municipalities. For that you typically have to go to a government sponsored website that distributes their own geographic data – or buy commercial data – or extract the layer from OpentStreetMap or other free provider. You now have two or more layers (e.g. Canadian Provinces from Natural Earth and Canadian Municipal Boundaries from government sources) that theoretically nest inside each other but probably won’t.
This imposes two tasks if you want layers to display correctly when superimposed and avoid introducing errors when engaging in some forms of analysis.
- Find the errors, if any, in your data.
- Fix the errors found – preferentially through automatic means.
A Short Definition of Topology as Used in GIS
Topology within the context of geographic digital datasets are the rules that related layers must follow if they are to be considered valid. Or to quote from the QGIS manual:
“Topology describes the relationships between points, lines and polygons that represent the features of a geographic region.”
For example, the boundaries of a Provinces layer must be entirely within the area of a Country boundaries layer; it may be permissible (and in some cases expected) that segments of the provincial boundaries be the same as the national boundaries but they should never cross.
Topology rules are human defined but can be enforced by running verification algorithms and fixing deviations or storage in formats that disallow error conditions. More sophisticated data models may allow deviations within certain ranges or flagging what otherwise might be considered errors as exceptions to the rules.
Aside from ensuring that overlayed maps display correctly, topology prevents analysis errors from being occurring. For example assume you have a point layer that represents customers in Canada; and a Provinces layer showing the boundaries of all provinces. You could implement what is called a spatial query – select all the customer points that fall within the Provinces polygon representing Ontario, and get a subset of customers living in Ontario. However if a few points fall outside the boundaries of the Provinces layer they may be missed in selection. This could be avoided by imposing a rule that all customer points need to fall within the boundaries of Provinces – if they don’t, an error is raised and corrective action needs to be taken.
Geometry Errors and their Consequences
Topology describes the valid relationships between layers. Within a layer itself, errors in geometry can also cause the layer to display incorrectly and impact analysis outputs or whether an algorithm can be run at all. An example of a geometry error is a polygon ring that does not close. A layer that defines areas on the ground – for example land parcels – is usually modelled as a collection of polygons, where a polygon is a series of points, connected by lines, where the last point connected is also the first point. If not all the points are connected, there is no polygon (and no area can be calculated, and perimeter length will be wrong … etc.).
Generally speaking you need to correct geometry errors in the individual layers first before topology rules can be applied. This is why a cleaning operation (v.clean or v.clean.advanced) is one of the early tasks in creating a new dataset – most of the geometry errors will be automatically fixed, and ambiguous cases flagged for review.
QGIS Data Quality Tools Investigated
The following are the tools I tried out. The list is not exhaustive – QGIS has many plugins, alternative algorithms and third party integrations that can do the same jobs in similar or very different ways. What is described below represents the most commonly used (and arguably, best pactrices) toolset.
|API exposed to python scripting.|
|Limited range of geometry errors checked for.|
|Replaced by newer tools.|
This in fact was not the first tool tested but came on the radar after I decided to broaden the investigation to find options that did not have the limitations of some of the more standard tools. Check Validity is one of the older tools in the QGIS quality control arsenal and as such frequently appears in Google searches. To a certain extent it is the pre-cursor of Geometry Checker and largely replaced it.
Check Validity is available from the Processing Toolbox > QGIS Algorithms > Vector Geometry Tools. Run from the GUI the following parameters are requested:
- Input Layer – The layer that is being validated.
- Method – Essentially this copies parameters from those already set in another processing environment; options are digitizer settings, QGIS and GEOS.
- Output Layers – Three layers are created; a Valid layer containing all features from the Input where no errors are found; an Invalid layer containing all features with an error and an attribute attached describing the error; and an Error point layer with the location of all detected errors.
Essentially the tool works by separating features with a perceived error from those that pass all tests as two output layers – named Invalid and Valid, respectively. A third point layer (named Error) is created where each point represents the position of an error.
You then load the Error layer to identify where the error is. The Invalid layer has an additional attribute added to each feature describing the type of error which informs you what kind of fix is needed.
Documentation, We Don’t Need No Stinkin’ Documentation
One of the most frustrating things about the Check Validity tool is that, despite its age, there is essentially no documentation. All that the QGIS manual has is a brief entry that defines the input parameters and has syntax for python scripting. It doesn’t even tell you what errors are being checked for.
After a lot of Googling and reading a few forum posts and bug reports I decided the only way I could know what the tool actually checks for is to empirically test it. I therefore created a layer from scratch, deliberately introducing some errors (a self intersection, a duplicate node, a polygon overlap and a duplicate feature). I ran Validity Check with this as the input layer – it found the first 2 errors but not the others (although strictly speaking the last two represent topology and not geometry errors).
GRASS Clean Algorithms
|API exposed to python scripting.|
|Eliminates most potential errors automatically.|
|Can run tools individually or in sequence.|
|Precise and complete documentation.|
|Potential to introduce new errors if tools run in wrong order or tolerance values inappropriate.|
QGIS includes a processing framework that allows third-party algorithms to be called from other GIS software libraries. Among these are GRASS (Geographic Resources Analysis Support System) which normally works in a command line environment. The framework allows GRASS algorithms to be run via a user-friendly graphical interface and have parameters supplied from the QGIS project.
The v.clean algorithm is a suite of tools that not only finds errors but automatically tries to correct them. The user runs the algorithm, supplying the tool name as one of the parameters. Additional parameters typically include the input layer, the feature geometry type and a name for the output.
V.clean or v.clean.advanced should be the first tool you run upon receipt of a new dataset. This is not so much to find errors as to eliminate the low hanging fruit in one pass.
V.clean.advanced is mostly the same as v.clean. The only difference is that multiple tools can be chained together to run in sequence. You may do this to get rid of multiple types of error in one execution; another reason is that some tools can have the side effect of introducing new errors that have to be cleaned up. Also, certain geometry errors can only be removed after simpler errors have been eliminated. The documentation recommends the sequence of tools to run.
|Checks a large number of the most common potential geometry errors.|
|Use default or custom parameters.|
|Easy to understand, configure and run.|
|Runs in its own process thread allowing user to continue to work on the QGIS project.|
|Very slow (but perhaps unavoidable).|
|Apply automatic fixes at your own risk.|
The dialog for this tool allows you to select up to 14 different types of potential geometry errors to check for in a single layer. After running the checks it either tries to automatically fix any errors in the input layer or creates an output layer that contains all the features where errors were detected. In either case it also produces a report that identifies each error by type and the coordinates where it occurs. In addition the report allows you to:
- Select a subset of errors.
- Export the report (which really means create a shapefile with error features).
- Fix errors automatically (all or selected).
- Fix errors after user selection of resolution method (many types of error can be resolved in more then one way).
- Modify resolution parameters (otherwise defaults will be used).
My main complaint is that it can take a considerable amount of time to run – 40 minutes to hours is what I experienced. This is somewhat inevitable with anything other then a trivially sized layer – for instance I was mostly testing against a polygon layer with roughly 200 000 nodes (points). You can continue other work in the QGIS project while its running (although I wouldn’t recommend doing anything with the input file) as long as you shove the dialog box out of the way (it can’t be minimized). There is an abort button – but all that seems to happen is the dialog changes from a progress bar to a message “Waiting for running checks to finish …” – which takes as long completing the uninterrupted algorithm and the dialog still doesn’t close.
I do not recommend using the automatic fix option – for whatever reason these take a long time (best case is that the tool is re-running checks and/or doing some automatic cleanup after the fix – but the documentation doesn’t explicitly say). If you’ve already used v.clean the number of errors found should be limited and can be manually repaired quickly (with a recommended v.clean following edits).
The Topology Checker can confirm whether two identified layers adhere to a specified topology rule. In some cases the rules enforced are not strictly topological and stray into the area of validating geometries (for instance the frustratingly vague Must not have invalid geometries). This is probably because the Topology Checker predates the Check Geometries plugin and tries to fill in the gap of previously insufficient geometry validation.
Rules are configured by:
- Opening the plugin.
- Clicking the configuration button.
- Specifying the target layer (the layer the rule will be enforced against).
- Specifying the rule (the selection of rules will depend on whether the layer is a point, line or polygon type).
- Possibly selecting a second layer.
- Possibly setting a tolerance (a proximity distance that flags an error condition).
- Click Add Rule
You can only select from layers that are loaded in the project. After defining the rules you can validate them at any time by clicking Validate and the definitions are saved with the project. Note that validation can take hours for a large, complex layer and unlike Check Geometries you cannot do any further work in the current QGIS session.
Once finished the Topology Checker shows a report of all errors found, identified by type and the id of the feature causing the error. You can zoom to a feature by clicking on its report line. The feature itself will be highlighted although the exact location of the error may not be clear. For instance when I validated a polygon layer it highlighted a specific polygon that had several duplicate points. Unfortunately this particular polygon represented Baffin Island, a very large island in the Canadian arctic with a complex coastline which therefore consisted of 200 000 points. The only way I could figure out the exact points in question was to start an edit session, select the node tool and open the Vertex Editor dialog so I could identify the points from their id.
GRASS Command Line
If you are going to run quality control through interactive procedures rather than scripting, this is probably the best tool to use. This is an unproven assumption on my part based on the argument that any command line application can be expected to process requests faster then a GUI based tool, if for no other reason than the smaller overhead demanded by the interface. My intention had been to run some GRASS equivalents to the tools already investigated to find something that would process more quickly but not require any python scripting. However an introductory reading of the GRASS manual quickly convinced me that using this software would be a project in its own right and could not be scoped to just learning enough to compare two or three algorithms.
The one take home lesson I retained was that to use the tool, I would have to import any layers I intended to work on into the native GRASS format and then export the results back to shapefiles (or some other more easily exchangeable format). The setup for this as well as learning the tool seemed too big an investment to pursue at the time but remains on my to-do list.