Project Description
Insights from the Thaum AI team.
We ran our NLP model on over 150,000 transcripts of Australian politicians and generated visualisations that display sentiment (emotional tone) of public discourse in marginal topics. We hope that with this information and AI Driven Knowledge Base we can be more mindful of our language and how it is percieved and the affect that has as it co-evolves over time with culture, disasters, protest, and media.
Ultimately, we hope this allows us to better understand and communicate with each other. Building empathy, respect, and self-awareness. I truly believe we're going to have a good old age.
Data Story
Our primary focus was collecting and cleaning as much publicly available political textual data.
Our primary datasets were the following:
Australian parliament - Record of Proceedings -Hansard API
https://researchdata.edu.au/search/#!/rows=15/sort=score%20desc/class=collection/q=hansard/p=1/
We found that the Australian parliament Hansard API very complete and easy to use. However, the Hansard XML files have a lot extraneous data that needed to be cleansed before we could apply NLP data.
We had a couple of use-cases for this dataset:
(1) general sentiment analysis conditioned on party affiliation,
(2) linguistic and semantic choices conditioned on party affiliation,
(3) language and sentiment expression depending on debate topic.
Commonwealth Parliamentary Debates (Hansard), 1901-1980
https://github.com/wragge/hansard-xml
We used this dataset to test our approach to the above while we obtained newer data.
PM Transcripts repository
https://github.com/wragge/pm-transcripts
This dataset had many significant issues that needed to be overcome before we could use it. In particular, the content is often in interview format. This means that analysis of PM language and sentiment would be mixed with that of the interviewer. Parsing this data proved difficult because the transcription formats varied quite a bit. This meant that we had to drop a lot of data. Furthermore, much of the data has been transcribed using OCR which has done a poor job. Our approach to getting around this was applying an automatic spellchecker. However, this is a far from perfect solution.
We used this for many different analyses: (1) PM sentiment, emotion and fine-emotion over time, (2) identifying PM rhetoric, (3) analysis of crisis response and language.
Federal Election speeches
https://electionspeeches.moadoph.gov.au/explore
Parliamentary press releases relating to immigrants and refugees
https://glam-workbench.github.io/trove-journals/#politicians-talking-about-immigrants-and-refugees
We used this to determine keywords for searching for crisis response language.