Data Science for Everyone
In the last decade, terabytes of information has been collected from various industries, not to mention historical data that has been collected for many decades. Organizations seem to be unsure of how they wish to use this data however, in the questions they wish to ask and the visualizations they wish to draw. There is a vast amount of untapped potential in this data, including the ability to more effectively handle finances of projects, more effective planning of projects and even the social impacts of particular decisions.
There has been a lot of work in the last couple of years to explore the data that has been gathered utilizing a wide array of technologies, including Machine Learning, dashboards through tools like Power BI, and many others. A lot of these tools require specialist training or access to expensive subscriptions, and would generally be for the benefit of the organization itself. This is where our project comes in.
We believe that there is a lack of available tools for the average user to understand their environment and how they could use data to build decisions to go about their day. There is also a lack of cross referenced data that could allow for novel correlations to be drawn. People like city planners could use a tool that allowed for merging of data sets to make decisions that not only took into account geographical feasibility and similar ideas, but the mental well being impacts on the community associated with the outcome of such decisions.
Drawing value from data
Data in a raw format is not useful in making decisions as it is so difficult to consume. The crash data for Victoria alone in the last 5 years as over 78,000 records. Without significant time spent on analysing, formatting and filtering the data (assuming those skills exist for an individual in the first place), the data can’t be used to assist in decision making. This becomes orders of magnitude more difficult when trying to merge disparate data sets which often come from different sources.
There were three main requirements that need to be addressed to allow for the most effective comparison of data.
Data visualisation - The key to more effective planning
Having tools that can display large sets of data in a way that a majority of people can read is an important step in making 'DataScience for everyone'. Even datasets with millions of records can be interpreted by the everyday person if it is presented in graphical formats. Use of colors, interactive displays, etc. are great ways of getting consumers of the information engaged.
Key columns - Correlations between numerous datasets
In one of our case studies, we attempted to merge AIWH 2015 Local Government Area Profiles data with the Victorian crash data set. This proved to be impossible with the data that we had due to the lack of similar columns between the data sets. Our objective was to determine if there was a correlation between mental health data and traffic collision data, including:
*The duration a person had been driving and the likelihood of being involved in an accident.
*The state of mind a person was in before having an accident.
Complete data - Consistency, presence and accuracy
Ensuring that records have accurate data is paramount to finding relevant trends. In the Victorian Crash Data dataset, we found some columns in particular rows had not been filled in with a valid value, for instance the day that the collision occurred. Fortunately the number of records without a recorded day were minimal, so the impact on any conclusions drawn would be of little impact. It does however lead to trust issues with the data, and a requirement to perform data wrangling to get it in a workable state. Ensuring that key fields are mandatory when entering the data will help to protect the records from missing critical data.
Often when trying to make use of large datasets, it can be found that particular information simply hasn’t been recorded. Having mechanisms that allows for feedback on desired data will help the parties gathering the data add in necessary fields into whatever forms are being submitted to the database. An example of a feedback form could be:
*Field name: What would the title of the column be.
*Field data: A description of the type of data that would be captured.
*Reason: Why does the person requesting the data want it? What benefits would it bring?
*Suggestion for acquisition: If the suggester has ideas as to how the data could be gathered, this should be included.
Using the format suggested above, we could request for drivers awareness prior to the crash to be added with the following:
*Field name: DRIVER_AWARENESS
*Field data: A numerical value that indicates how aware the driver was. Could be based on open their eyes were, how many blinks per minute, etc.
*Reason: To determine if driver fatigue was possibly a contributor to the collision.
*Suggestion for acquisition: Sensor in vehicle, or wearable tech.
Evidence of Work
2015 Local Government Area Profiles
Description of Use: We wanted to compare with road statistics with the area, but were unable due to the data formatting and time constraints with this exercise. We used the travel times, and related physical and mental health data statistics and related them back to the Victorian Average to see where the city of casey sits in the state.
Rainfall, temperature and wind forecast and observations - verification 2017-05 to 2018-04
Description of Use: Used in conjunction with traffic collision data to see if the weather had major impacts on the likelihood of a collision and traffic flow. Weather data was not included in the collision stats which seems like an oversight. Could be used to make sure that unnecessary infrastructure upgrades are done thinking that the intersection is a problem, and not simply the weather conditions at the time.
Crashes Last Five Years
Description of Use: Used to determine if there were interactions that were highly dangerous, and if there was some form of common theme among them that could indicate that their continued use in the future was not a good idea.
Traffic volume -Vicroads open data
Description of Use: Used to determine points of high congestion, and used in conjunction with other data sets (like travel duration to better understand stress on driver).
There are no challenges to display for this team as yet.