ICTM - Soutenance publique de thèse - M. Jérôme PAUL

Pour l’obtention du grade de Docteur en Sciences de l’Ingénieur
Feature Selection from Heterogeneous Biomedical Data
Lundi 29 juin 2015 à 14h00, auditoire SCES 01

Modern personalised medicine uses high dimensional genomic data to
perform customised diagnostic/prognostic. In addition, physicians record several
medical parameters to evaluate some clinical status. In this thesis we are
interested in jointly using those different but complementary kinds of variables to
perform classification tasks. Our main goal is to provide interpretability to
predictive models by reducing the number of used variables to keep only the most
relevant ones. Selecting a few variables that allow us to predict some clinical
outcome greatly helps medical doctors to understand the studied biological
process better.
Mixing gene expression data and clinical variables is challenging
because of their different nature. Indeed genomic measurements are expressed
on a continuous scale while clinical variables can be continuous or categorical.
While the biomedical domain is the original incentive to this work, we tackle the
more general problem of feature selection in the presence of heterogeneous
variables. Few variable selection methods jointly handle both kinds of features
directly. That is why we focus on tree ensemble methods and kernel approaches.
Tree ensemble methods, like random forests, successfully perform
classification from data with heterogeneous variables. In addition, they propose a
feature importance index that can rank variables according to their importance in
the predictive model. Yet, that index suffers from two main drawbacks. Firstly, the
provided feature rankings are highly sensitive to small variations of the datasets.
Secondly, while the variables are accurately ranked, it is very difficult to decide
which features actually play a role in the decision process. This work puts forward
solutions to those two problems. We show in an analysis of tree ensemble
methods stabilities that feature rankings get considerably stabler by growing more
trees than needed to obtain good predictive performances. We also introduce a
statistically interpretable feature selection index. It assesses whether the variables
are important in predicting the class of unseen samples. The output p-values offer
a very natural threshold to decide which features are significant. Apart from tree
ensemble approaches, there are few feature selection methods that handle
continuous and categorical variables in an embed-ded way. It is however possible
to build classifiers that profit from both kinds of data by using kernels. In this
thesis, we adapt those techniques to perform heterogeneous feature selection.
We propose two kernel-based algorithms that rely on a recursive feature
elimination procedure. The importance of the variables is extracted either from a
non-linear SVM or multiple kernel learning. Those approaches are shown to
provide state-of-the-art results in terms of predictive performances and feature
selection stability.

Membres du jury :
Prof. Pierre Dupont (UCL), Promoteur
Prof. Charles Pecheur (UCL), Président
Prof. John Lee (UCL), Secrétaire
Prof. Michel Verleysen (UCL)
Prof. Yvan Saeys (Universiteit Gent)
Prof. Daniel Hernández Lobato (Universidad Autónoma de Madrid, Espagne)

Publié le 15 juin 2015