Eight Steps for Enabling Location Analytics for Business Data
Part of the mission for my company is to provide the tools for non-experts to create professional quality geographic analysis products from their own data. Many business people and other organizational members can recognize the value of representing data with a location aspect on a map but have little knowledge on how to create the link between data and geography.
If you want to enable location analytics for business data your records need a reference to the physical place on Earth that they relate to. In general this is done by making business data an attribute of a geographic data set. What I intend to discuss in this article are the high-level steps required to achieve this.
1. Articulate the question and the information product sought.
The first step is to clearly define the question you are trying to answer. This will identify what business data you need to extract, any data preparation that might be necessary (such as field calculations, parsing or generalization) and an idea of the type of geographic data set that might be most suitable for your needs.
You should also consider what the analysis output should be – which does not necessarily mean a map. If you are interested in estimated travel times, for instance, a table with time in hours may be more intuitive. Similarly charts are superior for identifying trends and cycles. Note that the output format further informs your of what is required from the two data sets – travel time estimates require velocity information as well as distance; charts require fields that define time slices or categorize data.
Example – Show Canadian Customers Down to Sub-municipal Level
I have customers in Canada and in other parts of the English-speaking world. I think I can gain some insights for expanding my Canadian customer base by viewing existing customers as presented on a map. Ideally the map should segregate the distribution to the sub-municipal level, at least for large metropolitan areas.
2. Isolate the business records of interest.
Your data lives somewhere – in a spreadsheet, a database or invoicing system. You need to find the information relevant to the analysis question posed, and this will probably be only a small subset of everything you have. Moreover some parts of the data may need to be removed – to protect privacy or for security reasons.
Usually this is done through some kind of query that returns only information that matches certain criteria, and filtering may be applied as part of the query or implemented after the data arrives.
Example – Extract Canadian Customers from a Contacts List
From the definition of the question and selection of output format it can be inferred which business records need to be extracted. At a minimum each business needs to be uniquely identified and added to the list only once.
Continuing with the example above, Canadian customers need to be segregated from other countries. From a customer contact list a spreadsheet is created. Records are filtered to show only Canadian companies based on the country field. (NOTE: The spreadsheet shown below does not list any real companies, person names or contact information and consists of artificially created entities. Address fields other then street address are however valid).
Note that this table is structured such that each address component – street name and number, city, postal code, province/state – is in a separate field. These give us some candidates to use to link to a geographic data set.
3. Find a complimentary geographic data set.
Most digital geographic map data is in the form of themed layers. This means that a single data set will consist of a collection of geographic features of the same type (e.g. rivers, roads or municipal boundaries). Each feature will have geometric information that defines the space it occupies on the Earth’s surface and attribute information which describes its characteristics.
There are many factors to consider when obtaining a geographic data set. An incomplete list includes:
- Layer definition – the criteria that a feature has to meet to be included in the layer.
- Geometry type – most geographic data formats impose a limitation that all features must be points, lines or polygons. Each type is more or less suitable for specific types of analysis.
- Data format – there are many. Shapefiles (.shp) are among the oldest and a common exchange format. GeoJSON is common in web mapping and Google Maps uses KML. The most important deciding factor is whether you have tools that can accept the format or not.
- Cost – there are many free sources of data including from government agencies. Commercial data can also be purchased.
- Data use agreement – this can limit what you can do with the data. Many sources require (or at least request) attribution. Usage may be limited to non-commercial applications or you may be prevented from sharing outside of your own organization.
- Vintage – when was the data last updated ?
- Accuracy – what is the estimated accuracy, and what is good enough for your purposes ? Is this even known or can it be inferred ?
- Geographic extent – how much and which parts of the Earth’s surface is covered ? Often this will be described in terms of the largest encompassing geographic unit that falls entirely in the data set – such as “the continental United States” or “North America”. A more technical expression of this is bounding co-ordinates – the latitude and longitude pairs that describe the most bottom-left and top-right points that fall within the scope of the map.
Example – Postal Code Areas as an Approximation for Sub-municipal Levels
Applying this to our ongoing example, I already have a layer from a previous project that defines the forward sortation areas (FSA’s) for the Canadian postal system. These correspond directly to the postal codes I have in my contacts list (the first 3 characters represent the FSA – the last 3 characters are the local delivery unit).
4. Identify or create common fields to link business and geographic data.
To link business data with a geographic layer they must have a common field (or attribute, in the case of some geographic data formats). A common field is one where the value means the same thing in the both data sets. For example a list of property tax payers may include an id field that uniquely identifies each property they own. The land parcels geographic layer that the municipal land titles office maintains may have a field with exactly the same id, although it might not have the same name.
Even if there are two good candidate fields some impediments may need to be overcome. These can include:
- Different character encoding schemes.
- Different languages – for example one field uses English while the corresponding field uses a local language, with different spellings and possibly a different alphabet.
- Errors and omissions – in either or both data sets.
Example – Create a Common Field in Extracted Canadian Contacts List
Although the two data sets in my example scenario both have postal code fields they are not quite the same thing. My contacts list has the full six character postal code; my geographic data only has the higher level FSA. If I try to form a link between these two fields as they exist now there will be no matches. In this particular case there is a fairly straight-forward solution. Since the FSA is simply the first 3 characters of the complete postal code, all I need to do is truncate the values I already have in the contacts list and the two fields will have a common range of codes. This is simple in Excel which has a wide range of text manipulation functions; I add a new column, name it postal_fsa, and use a LEFT text function to define a formula that converts the existing postal codes to just the FSA portion.
5. Transform business data to an acceptable exchange format.
Linking data sets is usually done on the software platform that creates the maps or other analysis output. This means the business records need to be provided in a form the software can understand. If you are fortunate the format you already have is acceptable, otherwise you will have to export it.
Example – Export Extracted Canadian Contacts as CSV
In my example I will be using QGIS to do the analysis and create some simple maps. This open source mapping software does not natively support spreadsheets but can load text-delimited files, which is one of the export options for Excel. I therefore save my contacts list as comma separated values (CSV – file extension .csv), a particular type of text-delimited files that that separates values using commas.
6. Load both data sets into a mapping tool of choice.
Regardless of what software tool you use there are two common preliminary steps before you actually start to manipulate any data. These are:
i. Create a Workspace
A data analysis best practice to put all your information and outputs in a common location. This normally takes the form of a specific file folder or a segregated part of a database. Aside from organizational advantages putting data and supporting resources in a predictable location also makes automation easier.
ii. Create a New Instance
Just like creating a new Word document when a new report is begun it is necessary to create a new instance in your analysis software. Generally this takes the form of creating a new project, file or database object. At a minimum you will have to provide this instance with a name; other optional or mandatory configuration may be required, including identifying the location of the workspace.
Example – Create a QGIS Project and Load Data
In QGIS you create a project that defines the working environment and load your data into the project. My FSA postal layer is already in a format QGIS understands and therefore takes no special effort to load. To add the business data, now in CSV format, is a two step process. I browse to and open the file in the normal manner; at this point I need to supply some additional information.
- Confirm the file format as CSV.
- Identify the first row as defining column headers (field names).
If you upload a data file this way that consists only of records (the first row is also the first record) QGIS will simply name the fields “Field1”, “Field2” etc.. In general it is preferable to include meaningful custom field names with your imported data so that it is easier to identify which field you need to reference in queries and data visualization configurations.
- Indicate there is no geometry field associated with this data. This means that none of the fields represent co-ordinates such as latitude and longitude (this is why I have to link the business data to another data set that does have such information).
7. Link business data to geographic data.
An important consideration in linking two data sets is the relationship between the matching records. What this means is the number of records in data set A that can be expected to have matching records in data set B (defined by having the same value in their common field). The two most common scenarios are:
- 1:1 – One-to-one, where for every record in data set A there is exactly one corresponding record in data set B.
- 1:M – One-to-many, where for every record in data set A there can be multiple records in data set B.
Strictly speaking the ratios described above should include the possibility of no matching records on either side – in which case the one-to-one ratio might be more precisely written as 0-1:0-1 and one-to-many as 0-1:0-M.
A third possibility is a many-to-many relationship. Resolving such scenarios require a more complex data modelling approach and are outside the scope of this article.
Example – Define a Relation in the QGIS Project
Existing familiarity with my business data informs me that I have a one-to-many relationship between the FSA postal layer and customer contact list. I can also expect that in many cases there will be no matches.
In QGIS one-to-many relationships are defined at the project level through a relation. The relation has a parent (or referenced) layer which is the 1 (one) side of the relationship and a child (or referencing) layer which is the M (many).
In project properties I add a relation, give it a name, identifying the two participating data sets and their common fields. This is all I have to do to link the data sets.
8. Do the analysis.
Depending on what analysis you want to do, linking the data sets through their common fields may be sufficient to start running algorithms and generating outputs, or it might represent just a preliminary step to more data manipulation. The important fact is that your business data has now acquired a geographic reference and can be subject to a wide range of location analysis or displayed on a map.
Example – Display Canadian Market Penetration on a Map
Now that I have my own business data linked to a geographic layer there’s actually a lot I can do with it. To start with I can query the layer to show all areas (meaning FSA postal zones) that have at least one customer in them.
An alternative way of exploring the data is to select individual areas and view their properties; because my business data has now become part of the layer’s attributes any customers and their contact info will also be listed.