Address Geocoding and How It Works
Address geocoding can be defined as the process of obtaining geographical co-ordinates (latitude and longitude) for addresses and place names so that they can be mapped or subjected to location analysis. It works primarily by comparing the component elements of an address against a database of geometric points, lines and polygons that model the geography of the Earth and identifying which of these geometries have description properties that best match those of the input address. A set of co-ordinates is then returned that represents the system’s best estimate of the actual location for the address. Geocoding is offered as a paid or free service by a number of companies and not-for-profit organizations. It is an efficient tool for placing address locations on a map but does suffer limitations in accuracy and precision.
This article is intended to provide readers with a definition and brief overview of the process of geocoding. The expected audience are people who have never encountered the term before or who have a casual familiarity and are seeking a more formal definition. This article is limited in scope to the process of geocoding from addresses. Other forms of geocoding exist but Your Map Data at the time of writing only uses addresses as inputs. From this article you will learn:
- A short definition for address geocoding, already given in the introductory paragraph.
- An overview of the geocoding process by demonstrating one common implementation.
- Identify the level of accuracy and precision that can be expected, and why.
- Suggest some best practices in preparing your data.
How Geocoding Works
The general approach is to break up an address into its component parts such as street name, city and state and compare these against the attributes of a digital geographic dataset. Where the best match occurs it is concluded that the same real world location is being described by both the address and the dataset and that therefore the known dataset co-ordinates apply to the address as well.
Most geocoding is done through the use of computer software known as a Geographic Information System (GIS). Although there are many formats that geographic data can be stored in, the most common use connected points to form lines and polygons. The connections are not random – a direction is explicitly defined so that spatial relationships can be understood. For example if a closed loop of connected points is defined to create a polygon and the direction of the loop is clockwise, it can be further deduced that any points lying to the right of the line segments are within the polygon and those to the left are outside. This property of line direction is taken advantage of in geocoding processes, as the example below illustrates.
Estimating Co-ordinates from a Line Property
Because the geographic dataset is often in the form of lines or polygon areas rather then points, some approximation has to occur to get a single pair of co-ordinates. For example a single line segment (representing the highest resolution object in the dataset) may be associated with all the street numbers from 1 to 10. Therefore if the street number is known to be 3 the exact co-ordinates are evaluated to be those that lie three-tenths along the length of the line segment from its starting point (and because of the property of direction, this starting point is explicitly defined).
To accommodate variations in spelling, incomplete address information or minor typing errors the match does not have to be exact. Administrators operating the geocoding service usually set a tolerance for how close a match has to be to return a result – for example 75% may be good enough. More complex matching algorithms may return results of increasing granularity – for instance if the address cannot be mapped to an exact street number, but the postal code matches, then what might be returned would be the center co-ordinates of the region defining the postal code.
Geocoding Service Providers
The investment for an enterprise size GIS and maintaining an up to date database is significant and therefore many application providers do not do so. Instead they subscribe to a mapping service such as Google Maps which provides both background maps and analytical services including geocoding. Your Map Data at the time of writing uses MapQuest as its provider. Requests for geocoding are implemented through what is known as an Application Programming Interface (API). The API exposes part of MapQuest’s database to its customers who can then query it to do such things as geocode addresses, get travel routes and load street maps. The MapQuest API needs to receive the address information in the form of input parameters. The requestor has the option to either send the whole address as a single parameter or to send up to five separate parameter values, each representing a specific address component. The advantage of the former method is that addresses do not have to be parsed out into their separate elements before processing occurs. The advantage of the second method is that data formats that already have address elements separated out are more easily accommodated and the returned co-ordinates tend to be more accurate (because there is no ambiguity about whether a particular value represents e.g. a country, province or city).
Regardless of how the addresses are sent the API will return three things to the requestor:
- The co-ordinates in decimal degrees latitude and longitude (e.g. 43.12345, -79.12345 – or possibly nothing if there is an error).
- A success code, which indicates one of:
- Success– meaning a pair of co-ordinates could be associated with the address.
- Input error– basically meaning that the parameter(s) sent could not be interpreted as an address.
- Key related error– meaning that part of the parameter could be interpreted as an address but one specific component could not; for instance, if street name did not match closely enough with any street known to fall within the larger geographic area. In this case you would still receive a latitude and longitude, but it would be expected to be less accurate since the co-ordinates would relate to a more higher geographic level.
- Unknown error– basically any other type of error preventing a geocoding operation. For instance if the MapQuest server happened to be overloaded with requests at that particular time, you might see this error.
- A quality code. This is an estimate of how accurate the returned results are. It is further broken down into two parts:
- Granularity– the geographic level the co-ordinates are based on. For instance it may be based on the postal code (because no street by the name specified could be matched). The highest level of granularity occurs if a street number and name can be matched to the database.
- Confidence– basically this means the degree of certainty that the co-ordinates given correspond to the real world location for the identified level of granularity. So for instance the geocoding may only be able to resolve to the level of a postal code, but there may be high confidence that it is the correct postal code match.
Data Preparation Best Practices
There are a number of data preparation practices that can be helpful in obtaining the best results from a geocoding operation. Some of these are dependent on the range of options a particular geocoding service provides.
- If possible separate complete addresses into individual address elements. Then configure your geocoding request such that it is parsed by the individual address elements. This ensures that there is no ambiguity about what part of an address each value represents.
Example of part of a MapQuest Geocoding Query Parsed by Individual Address Elements
street: ‘1600+Pennsylvania+Ave+NW’, city: ‘Washington’, state: ‘DC’, postalCode: ‘20500’
- Include as many address components as known to ensure the geocoder has as much information to work with as possible. Including Country can help identify if you mean London, United Kingdom or London, Canada.
- If you do send your geocoding request in the form of a complete address, ensure you follow the syntax rules imposed by the geocoder. In general address elements are expected in a particular order, may need to be enclosed in quotes and are usually separated by commas.
- Take advantage of optional geocoding parameters. Many geocoders allow you to restrict your queries to a particular geographic extent by entering a country code or a bounding box defined by maximum and minimum latitudes and longitudes. This can filter out invalid results. Another common option is specifying language; this flags whether place names are using English or localized spellings, and may distinguish whether Latin or other alphabets are being used.
Limits of Geocoding
Address geocoding is a relatively easy way to acquire geographic co-ordinates and ultimately map locations that can be associated with an address. It does have limitations, including:
- You can’t map anything that doesn’t have an address. At best you can associate it with a high level geographic area, such as a state or country.
- You are limited to representing a real world location as a single point on a map. If you need to map the actual area occupied by an object (e.g. the boundaries of a property or the footprint of a building) then a single pair of co-ordinates is insufficient.
- The geocoding service provider must have the address in their database. Thus newly developed areas may not be recognized yet. Similarly if streets are renamed or residences are renumbered there may be a discrepancy in the results that are returned or an outright error.
- The different address formats across jurisdictions need to be accommodated. Address practices in Europe can be different from North America; other continents even more so. Errors can occur if the geocoding service provider is not properly setup to recognize the differences, or a requestor provides address information in a format not appropriate for the target region.
- Since an address actually resolves to a property, the actual set of co-ordinates returned only approximates where the entrance to the property exists. For residential houses and small businesses this is not an issue since the entrance way is usually obvious once you arrive at the location. But for properties covering a large extent (such as an industrial complex or military base) the actual entrance might be located quite distant from the point co-ordinates provided.