TY - JOUR
T1 - The Tree Based Linear Regression Model for Hierarchical Categorical Variables
AU - Carrizosa, Emilio
AU - Mortensen, Laust Hvas
AU - Romero Morales, Dolores
AU - Sillero-Denamiel, M. Remedios
PY - 2022/10
Y1 - 2022/10
N2 - Many real-life applications consider nominal categorical predictor variables that have a hierarchical structure, e.g. economic activity data in Official Statistics. In this paper, we focus on linear regression models built in the presence of this type of nominal categorical predictor variables, and study the consolidation of their categories to have a better tradeoff between interpretability and fit of the model to the data. We propose the so-called Tree based Linear Regression (TLR) model that optimizes both the accuracy of the reduced linear regression model and its complexity, measured as a cost function of the level of granularity of the representation of the hierarchical categorical variables. We show that finding non-dominated outcomes for this problem boils down to solving Mixed Integer Convex Quadratic Problems with Linear Constraints, and small to medium size instances can be tackled using off-the-shelf solvers. We illustrate our approach in two real-world datasets, as well as a synthetic one, where our methodology finds a much less complex model with a very mild worsening of the accuracy.
AB - Many real-life applications consider nominal categorical predictor variables that have a hierarchical structure, e.g. economic activity data in Official Statistics. In this paper, we focus on linear regression models built in the presence of this type of nominal categorical predictor variables, and study the consolidation of their categories to have a better tradeoff between interpretability and fit of the model to the data. We propose the so-called Tree based Linear Regression (TLR) model that optimizes both the accuracy of the reduced linear regression model and its complexity, measured as a cost function of the level of granularity of the representation of the hierarchical categorical variables. We show that finding non-dominated outcomes for this problem boils down to solving Mixed Integer Convex Quadratic Problems with Linear Constraints, and small to medium size instances can be tackled using off-the-shelf solvers. We illustrate our approach in two real-world datasets, as well as a synthetic one, where our methodology finds a much less complex model with a very mild worsening of the accuracy.
KW - Hierarchical categorical variables
KW - Linear regression models
KW - Accuracy vs. model complexity
KW - Mixed integer convex quadratic problem with linear constraints
KW - Hierarchical categorical variables
KW - Linear regression models
KW - Accuracy vs. model complexity
KW - Mixed integer convex quadratic problem with linear constraints
U2 - 10.1016/j.eswa.2022.117423
DO - 10.1016/j.eswa.2022.117423
M3 - Journal article
SN - 0957-4174
VL - 203
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 117423
ER -