Tax Help Helper
We are a team of data analysts in Brisbane whose membership is comprised of Brendan Sulivan, Mathew Taylor and Luke Ginn. Our project is called Tax Help Helper and we believe this is the solution to ‘The Friendly ATO’ and ‘Tax Return Help Centers’ challenges.
Without previous experience or support, understanding how the tax system works can be a struggle for new users. The Tax Help Center program provided by the Australian Tax Office (ATO) acts as the vital assistance to those who require this support. However, the challenge facing the ATO is on where and who needs this help the most. This information is key to the most effective deployment of ATO resources to support the maximum amount of people possible.
Using a ‘Machine Learning’ approach, our vision is to provide the ATO the key information they need to utilize their resources to the best engagement of their clients.
You can access our solution:
Our Video Solution:
The information we used for this project was provided by the ATO, as well as additional information from the Australian Bureau of Statistics. Geolocation data was necessary for our project's visualization and user interface.
We joined datasets together via a left outer join with all unique postcodes and used a union to combine the years together. This was done using a conjunction of R and SQL.
The machine learning algorithm learned off 2006 and 2011 data. We then predict the results on the 2016 data. The model of choice was xgboost with Bayesian optimization. The objective function of the model was to minimize the RMSE (root mean squared error). The choice of the algorithm was due to it’s robust nature in preventing overfitting and accuracy, but also because it is able to handle missing data which may be a problem commonly faced by the ATO or ABS.
Through the hypothesis process we looked into removing data columns for a process of feature selection. Trained on only 2011 data or the combination of 2006 and 2011. We looked into aggregating postal codes by regions in Australia for the model to learn from. After all these iterations the training of 2006 and 2011 data without an aggregation of postal codes was decided for both model accuracy and usability for the ATO.
We were satisfied with the results when the accuracy was around 90% and recognized the remaining error may actually a result of a misuse of the tax help center resources and may be approaching the theoretical limit to this data science problem.
To make the product useable, we published it online to a Tableau server so that it can be used by the general public or by taxation officers. This link could be sent to all tax help center volunteers across Australia or to help identify which postal codes may require tax help centers in the following years.
Evidence of Work
Gov Hack 2018
Description of Use: All the sheets were used
Description of Use: The longitude and latitudes are joined to the GovHack2018 data for visualisation purpose
Check back here once the first checkpoint passes to see the challenges this team has entered.