Fake News Detection and Production Using Transformer-based NLP Models

Branislav Sándor, Frode Paaske & Martin Pajtás

Student thesis: Master thesis

Abstract

This paper studies fake news detection using the biggest publicly available dataset of naturally occurring expert fact-checked claims (Augenstein et al., 2019). Based on existing theory that defines the task of fake news detection as a binary classification problem, this paper conducted an extensive process of reducing the label space of the dataset. Traditional machine learning models and three different BERT-based models were applied to the binary classification task on the data to investigate the performance of fake news detection. The RoBERTa model performed the best with an accuracy score of 0.7094. This implies that the model is capable of capturing syntactic features from a claim without the use of external features. In addition, this paper investigated the feasibility and effects of expanding the existing training data with artificially produced claims using the GPT-2 language model. The results showed that the addition of artificially produced training data, whether fact-checked or not, generally led to worse performance of the BERT-based models while increasing the accuracy scores of the traditional machine learning models. The paper finds that the Naïve Bayes model achieved the highest overall score on both the fact-checked and non-fact-checked artificially produced claims in addition to the human-produced training data, with accuracy scores of 0.7058 and 0.7047, respectively. These effects were hypothesized to be caused by differences in the underlying architecture of the different models, particularly the self-attention element of the Transformer architecture might have suffered from the stylistic and grammar inconsistencies in the artificially produced text. The results of this paper suggest that the field of automatic fake news detection requires further research. Specifically, future work should address the lack of sufficient data quality, size, and diversity, the increasing demand for computational resources, and inadequate inference speed severely limiting the application of BERT-based models in real-life scenarios.

EducationsMSc in Business Administration and Information Systems, (Graduate Programme) Final Thesis
LanguageEnglish
Publication date2020
Number of pages102
SupervisorsDaniel Hardt