Abstract
This thesis consists of six chapters including the introduction and the conclusions. The chapters are dedicated to enhancing the transparency of key models in Machine Learning. In this dissertation, I propose novel Mathematical Optimization models to trade off accuracy and transparency in Cluster Analysis, Supervised Classification, and Treatment Allocation.
In Chapter II, co-authored with Emilio Carrizosa, Alfredo Mar´ın, and Dolores Romero Morales, we tackle the problem of enhancing the interpretability/explainability of the results of Cluster Analysis, which is one of the transparency criteria pursued in this dissertation. Our goal is to find an explanation for each cluster, such that clusters are characterized as precisely and distinctively as possible, i.e., the explanation is fulfilled by as many as possible individuals of the corresponding cluster, true positive cases, and by as few as possible individuals in the remaining clusters, false positive cases. We assume that a dissimilarity between the individuals is given, and propose distance-based explanations, namely those defined by individuals that are close to its so-called prototype. To find the set of prototypes, we address the bi-objective optimization problem that maximizes the total number of true positive cases across all clusters and minimizes the total number of false positive cases, while controlling the true positive rate as well as the false positive rate in each cluster. We develop two Mixed Integer Linear Programming (MILP) models, inspired by classic Location Analysis problems, that differ in the way individuals are allocated to prototypes. We illustrate the explanations provided by these models and their accuracy in both real-world data as well as simulated data.
In Chapter III, co-authored with Emilio Carrizosa, Alfredo Mar´ın, and Dolores Romero Morales, we make Cluster Analysis more interpretable with a new approach that simultaneously allocates individuals to clusters and gives rule-based explanations to each cluster. The traditional homo-geneity metric in clustering, namely the sum of the dissimilarities between individuals in the same cluster, is enriched by considering also, for each cluster and its associated explanation, two ex-plainability criteria, namely, the accuracy of the explanation, i.e., how many individuals within the cluster satisfy its explanation, and the distinctiveness of the explanation, i.e., how many individu-als outside the cluster satisfy its explanation. Finding the clusters and the explanations optimizing a joint measure of homogeneity, accuracy, and distinctiveness is formulated as a multi-objective MILP problem, from which non-dominated solutions are generated. We illustrate the clusters and the accuracy of the corresponding explanations in real-world data.
In Chapter IV, co-authored with Emilio Carrizosa and Dolores Romero Morales, we investigate how to make tree ensembles in Supervised Classification more transparent, incorporating by design explainability and fairness criteria. While explainability helps the user understand the key features that play a role in the classification task, with fairness we ensure that the ensemble does not discriminate against a group of observations that share a sensitive attribute. We propose an MILP formulation to train an ensemble of trees that apart from minimizing the misclassification error, controls for sparsity as well as the accuracy in the sensitive group. Our formulation is scalable in the number of observations since its number of binary decision variables is independent of the number of observations. In our numerical results, we show that for standard datasets used in the fairness literature, we can dramatically enhance the fairness of the benchmark, namely the popular Random Forest, while using only a few features, all without damaging the misclassification error.
In Chapter V, I investigate the Treatment Allocation problem, where one has to decide which individuals will receive treatment and which not. If not carefully trained, the algorithm may provide unfair results, unequally allocating treatment to individuals in the sensitive (e.g., females) and non-sensitive (e.g., males) groups. To deal with it I propose to measure unfairness as the difference between the average treatment effects in the sensitive group and the non-sensitive group. I introduce a Mathematical Optimization model to have accurate heterogeneous treatment effect predictions and a good level of fairness, which will be the basis for the treatment allocation in forthcoming individuals. I present results on simulated datasets, illustrating that my model provides fairer predictions of the treatment effect than the benchmark.
In Chapter II, co-authored with Emilio Carrizosa, Alfredo Mar´ın, and Dolores Romero Morales, we tackle the problem of enhancing the interpretability/explainability of the results of Cluster Analysis, which is one of the transparency criteria pursued in this dissertation. Our goal is to find an explanation for each cluster, such that clusters are characterized as precisely and distinctively as possible, i.e., the explanation is fulfilled by as many as possible individuals of the corresponding cluster, true positive cases, and by as few as possible individuals in the remaining clusters, false positive cases. We assume that a dissimilarity between the individuals is given, and propose distance-based explanations, namely those defined by individuals that are close to its so-called prototype. To find the set of prototypes, we address the bi-objective optimization problem that maximizes the total number of true positive cases across all clusters and minimizes the total number of false positive cases, while controlling the true positive rate as well as the false positive rate in each cluster. We develop two Mixed Integer Linear Programming (MILP) models, inspired by classic Location Analysis problems, that differ in the way individuals are allocated to prototypes. We illustrate the explanations provided by these models and their accuracy in both real-world data as well as simulated data.
In Chapter III, co-authored with Emilio Carrizosa, Alfredo Mar´ın, and Dolores Romero Morales, we make Cluster Analysis more interpretable with a new approach that simultaneously allocates individuals to clusters and gives rule-based explanations to each cluster. The traditional homo-geneity metric in clustering, namely the sum of the dissimilarities between individuals in the same cluster, is enriched by considering also, for each cluster and its associated explanation, two ex-plainability criteria, namely, the accuracy of the explanation, i.e., how many individuals within the cluster satisfy its explanation, and the distinctiveness of the explanation, i.e., how many individu-als outside the cluster satisfy its explanation. Finding the clusters and the explanations optimizing a joint measure of homogeneity, accuracy, and distinctiveness is formulated as a multi-objective MILP problem, from which non-dominated solutions are generated. We illustrate the clusters and the accuracy of the corresponding explanations in real-world data.
In Chapter IV, co-authored with Emilio Carrizosa and Dolores Romero Morales, we investigate how to make tree ensembles in Supervised Classification more transparent, incorporating by design explainability and fairness criteria. While explainability helps the user understand the key features that play a role in the classification task, with fairness we ensure that the ensemble does not discriminate against a group of observations that share a sensitive attribute. We propose an MILP formulation to train an ensemble of trees that apart from minimizing the misclassification error, controls for sparsity as well as the accuracy in the sensitive group. Our formulation is scalable in the number of observations since its number of binary decision variables is independent of the number of observations. In our numerical results, we show that for standard datasets used in the fairness literature, we can dramatically enhance the fairness of the benchmark, namely the popular Random Forest, while using only a few features, all without damaging the misclassification error.
In Chapter V, I investigate the Treatment Allocation problem, where one has to decide which individuals will receive treatment and which not. If not carefully trained, the algorithm may provide unfair results, unequally allocating treatment to individuals in the sensitive (e.g., females) and non-sensitive (e.g., males) groups. To deal with it I propose to measure unfairness as the difference between the average treatment effects in the sensitive group and the non-sensitive group. I introduce a Mathematical Optimization model to have accurate heterogeneous treatment effect predictions and a good level of fairness, which will be the basis for the treatment allocation in forthcoming individuals. I present results on simulated datasets, illustrating that my model provides fairer predictions of the treatment effect than the benchmark.
Original language | English |
---|
Place of Publication | Frederiksberg |
---|---|
Publisher | Copenhagen Business School [Phd] |
Number of pages | 127 |
ISBN (Print) | 9788775682737 |
ISBN (Electronic) | 9788775682744 |
DOIs | |
Publication status | Published - 2024 |
Series | PhD Series |
---|---|
Number | 21.2024 |
ISSN | 0906-6934 |