Project Description
The Problem
"addresses were originally created for human-to-human communication within a given context, not for data science. How can we solve addresses and more easily connect our datasets?"
- An address that is human-readable isn't understood by machines
- An address that is machine-readable isn't understood at all by humans
Existing models rebuild another world to mirror ours, where addresses are IDs associated to latitudes and longitudes. Then the user must fuzzy search their address and select a pre-made entity. But when their address is not found the system falls down. User input is not easily integrated into this 'non-matching' model of the real world. In other words, real world address data does not match existing geolocation data.
The Solution
FunnelCat bridges the gap between human-readable and machine-readable locations. Any location can quickly be converted to a structured hierarchy representing proximity from one address to the other. The more addresses that are input into the system (regardless of their format), the more accurate the results.
Because a street address is an entirely human construct, we need to work with this concept centred around entities that are close to other entities, rather than imposing a narrow pinpoint lat/long model and having addresses strictly match that. Human beings create new addresses all the time; the address comes first, then we distill a place down to a single point. Let's work with the data first as it's generated, rather than reverse engineering our pre-built models to match a string.
After all, an address is more than a point in space, it's a name needed for memorisation and directions. Have you ever given directions and said to start at Latitude: 37.818637° S, Longitude: 144.9637° E, then take a right at 37.8150° S, 144.9665° E, bearing 60 degrees north until you hit Town Hall?
FunnelCat highlights what can be achieved by "going back to basics", by having simple human-readable addresses which can quickly be converted to machine-readable addresses, which are easily categorised and geolocated.
Data Story
Data Format
Addresses are entered simply as strings in the traditional human-readable formats such as:
- 1 Milky Road, Beautiful Town, Victoria, Australia
- 7 Sensational Avenue, Korma Creek, Victoria
They are then converted into their most basic parts:
Data Structure
Internally FunnelCat is implemented as a search tree.
What happens when multiple addresses match when funnelling data through the tree?
When multiple matches are possible, the address is flagged for manual intervention. Realistically most data sets will have enough context such as country and state to mitigate manual intervention. Other techniques can be used to improve accuracy on big datasets; unit numbers and street numbers have a particular range, and when a new address is presented without a country/state, then an address that is within the range of one match and outside the range of another match, is likely a match with the first one.