Investigating Gender Bias in Job Advertisements with Word Embeddings

Krista Vágsheyg & Charlotte Sophie Wilhelmsen

Student thesis: Master thesis

Abstract

This thesis is an explorative study that investigates gender bias in Danish job advertisements from the platform Jobindex that is the largest Danish database of job advertisements. The methods used in this thesis combines the fields of computer science with linguistics to utilize natural language processing and specifically word embeddings. The technology used to investigate gender bias in the advertisement is provided by fastText, which was created by an AI research team from Facebook. This thesis attempts to automatically calculate a gender bias score from the words in a job advertisement, by comparing the similarity of the advertisement to the male and female identifiers ‘han’ and ‘hun’. This is performed by vectorising all terms in the advertisement and averaging the score of the advertisement using the cosine angle of each vector. The scores range from -1 being extremely female bias, and 1 being extremely male bias, scores that are close to 0 are interpreted as neutral. The empirical data was collected from Jobindex and consist of four years of job advertisements: 2008, 2014, 2017 and 2019, which we compare to the statistics of gender distribution of the Danish industries using the data provided by Statistics Denmark. Our approach in this thesis was to manually annotate 100 advertisements to uncover subconscious bias and whether the advertisements were directed towards a specific gender, which was compared to the automatic scores. Our results show that the word embeddings can be used to uncover bias, however, there are several questionable aspects to the word embeddings. We found that the level of bias in function words is high, and therefore impacts the overall score. Furthermore, we found that certain occupations related to both teaching and especially public services were very male bias in the pretrained Wikipedia model. For future work of this project we would like to further develop our approach by discounting function words and building an automatic classifier by extensively increasing the manual annotation.

EducationsMSc in Business Administration and Information Systems, (Graduate Programme) Final Thesis
LanguageEnglish
Publication date2019
Number of pages118
SupervisorsDaniel Hardt