Affordable Access

Variable selection and outlier detection viamixed integer programming

  • Jammal, Mahdi
Publication Date
Dec 17, 2020
External links


Two principles at the forefront of modern machine learning and statistics are sparse modeling and robustness. Sparse modeling enables the construction of simpler statistical models. At the same time, statistical models need to be robust they should perform well when data is noisy in order to make reliable decisions. While sparsity and robustness are often closely related, the exact relationship and subsequent trade-offs are not always transparent. For example, convex penalties like the Lasso are often motivated by sparsity considerations, yet the success of these methods is also driven by their robustness. In this thesis, we work at improving the quality of the estimators in supervised machine learning in terms of robustness and sparsity. We propose methods that, jointly, perform variable selection and outlier detection, and formulate the obtained optimization problems using mixed integer programming (MIP) benefiting from the significant improvement of MIP solvers. First we focus on proposing a robust and sparse method for linear regression. To solve the problem exactly, we recast it as a mixed integer programming problem. In addition, and in order to decrease the computational time, we propose a discrete first algorithm providing a near-optimal solution in a very short time. The obtained solution is used as a warm start for the MIP solver. However, the proposed problem suffered from overfitting for low signal-to-noise ratio values. Then, to fix the overfitting behavior of the proposed method, we use penalized regularization to improve its performance when the noise is high. We also propose a discrete first order algorithm to solve the regularized approach. Finally, we propose a robust and sparse classification method based on the classical hinge-loss classifier. The obtained problem is formulated using mixed integer programming and shown to be efficient on both real and synthetic data sets.

Report this publication


Seen <100 times