Dedupe me

Project Info

TinyIdeas  💡 thumbnail

Project Description


Machine learning based entity resolution to the rescue


Data Story


Databases somehow always end up with duplicate entries but we can solve that using machine learning based entity resolution (a.k.a record linkage, fuzzy matching, etc).

Entity resolution typical requires:
1) Deduplication (removal of exact copies of records)
2) Record Linkage (records that may reference the same business)
3) Canonicalization (ensuring data with more than one representation are in a standardised form)

Only steps 1 and 2 were addressed during this challenge of which out of 47404 records, 1920 unique businesses were identified using csvdedupe (https://github.com/dedupeio/csvdedupe)

Perhaps you can even use this during form filling and validation to reduce any further duplicates.

NB. Using Excel for step 1, and csvdedupe for step 2 which is simply a CLI program the only evidence of work is the training data generated by the program.


Evidence of Work

Video

Homepage

Team DataSets

IP Australia Govhack 2018 sample data

Data Set

ABR lookup

Data Set

Challenges

Bounty: Finding all the like needles in the haystack

Region: Australia

Challenge

Matching Applicants

Region: Australia

Challenge
Back to Projects