Classification of Covid-19 Misinformation: A Novel Transfer Learning Framework for the Detection of COVID-19 Misinformation on US Twitter Data

Sophia Wallgren & Jakob Axel Agaton Nilsson

Student thesis: Master thesis

Abstract

As governments around the world struggled to mitigate the spread of COVID-19, their efforts were undermined by the online propagation of virus misinformation, or false facts. Social media platforms facilitated misinformation diffusion, effectively exacerbating the pandemic severity and highlighting the need for Machine Learning systems to detect such comments automatically. Research within this domain has primarily utilized supervised models on small datasets, and no studies have leveraged raw tweet data to analyze the characteristics of misinformation and the geographical differences in the US. This study bridges this research gap using a novel transfer learning framework for misinformation detection on US Twitter data. We use Naive Bayes, Support Vector Machine, and Logistic Regression models for misinformation detection, on a corpus of existing labeled datasets from the literature. With transfer and semi-supervised learning, we classify a dataset of unlabeled COVID-19 related tweets collected using the Twitter API and perform sentiment analysis with VADER. The misinformation predictions are manually evaluated using a sample to verify performance. We explore the relationships between misinformation, sentiment, geography, and COVID-19 cases and deaths by state. Our logistic regression misinformation classifier achieved a 96% F1 score on the labeled dataset. The transfer learning and semi-supervised models achieved 58% and 65% macro F1 scores on a manually evaluated sample from the unlabeled Twitter data. We observed that misinformative tweets have a higher sentiment intensity and are more negative than informative tweets. We identified a statistically significant relationship between latitude and the number of COVID-19 deaths with linear regression. We find that the proportion of misinformation is highest in the Southern and lowest in the Midwest. The misinformation detection framework presented in this thesis shows great potential for robust classification of large and unlabeled COVID-19 Twitter data and guides future work in this domain

EducationsMSc in Business Administration and Data Science, (Graduate Programme) Final Thesis
LanguageEnglish
Publication date2022
Number of pages125
SupervisorsDaniel Hardt