Computational Analysis of the Language used by Twitter Users: Big Data Approach Predicting Social Media Addiction

Rima Brazinskaité & Rumyana Todorova

Student thesis: Master thesis


The following research examines language patterns used in Tweets such as combination of words, sentences and sentiment of words. The purpose of the research is identifying two groups of social media users on the basis of the language used by them. For this purpose, textual data is used to classify groups of users – “Heavy/Addicted Users” and “Normal Users”. Further, we explored how big data brings value in the field of research – addiction and how health care services can benefit from it.
The CRIPS-DM methodology for data mining was employed to ensure systematic process of the steps involved. Two approaches were taken in the paper – automatic and manual analysis. For the automatic analysis – regression model was applied on already scraped Twitter data and bag-of-words representation to investigate language patterns. For the manual analysis – example of tweets were used, a method involved identification of specific language such as use of interjections, exclamation marks, tweets in capital letters. Therefore, the aim is to find language patterns which are used by “Addicted/Heavy” users and “Normal” users.
Findings show that it is possible to identify language patterns from textual data in order to classify Twitter users in two groups. However, in order to do that, we would need a context and additional information about the tweets from users. Some of the words they used might mean different things depending on in what context the person uses this word.
Additionally, we applied machine learning model to predict social media addiction based on the Tweets. When comparing the two approaches – machine learning performed well in terms of accuracy; however, when analysing language, it was evident that the manual analysis resulted in better identification of language patterns. This process was applied to only three users, for future work we suggest this is applied to more users.

EducationsMSc in Business Administration and Information Systems, (Graduate Programme) Final Thesis
Publication date2020
Number of pages127
SupervisorsDaniel Hardt