Thèse de doctorat en Bioinformatique
Sous la direction de Thérèse Malliavin.
Soutenue le 16-06-2015
Le jury était composé de Didier Rognan, Véronique Stoven, Andreas Bender, Raphael Guerois, Alessandra Carbone.
Applications de proteochemometrics : à partir de l'extrapolation des espèces à la modélisation de la sensibilité de la lignée cellulaire
Proteochemometrics (PCM) est une bioactivité prophétique la méthode posante de simultanément modeler la bioactivité de ligands multiple contre des objectifs multiples...
Proteochemometrics (PCM) is a predictive bioactivity modelling method to simultaneously model the bioactivity of multiple ligands against multiple targets. Therefore, PCM permits to explore the selectivity and promiscuity of ligands on biomolecular systems of different complexity, such proteins or even cell-line models. In practice, each ligand-target interaction is encoded by the concatenation of ligand and target descriptors. These descriptors are then used to train a single machine learning model. This simultaneous inclusion of both chemical and target information enables the extra- and interpolation to predict the bioactivity of compounds on targets, which can be not present in the training set. In this thesis, a methodological advance in the field is firstly introduced, namely how Bayesian inference (Gaussian Processes) can be successfully applied in the context of PCM for (i) the prediction of compounds bioactivity along with the error estimation of the prediction; (ii) the determination of the applicability domain of a PCM model; and (iii) the inclusion of experimental uncertainty of the bioactivity measurements. Additionally, the influence of noise in bioactivity models is benchmarked across a panel of 12 machine learning algorithms, showing that the noise in the input data has a marked and different influence on the predictive power of the considered algorithms. Subsequently, two R packages are presented. The first one, Chemically Aware Model Builder (camb), constitues an open source platform for the generation of predictive bioactivity models. The functionalities of camb include : (i) normalized chemical structure representation, (ii) calculation of 905 one- and two-dimensional physicochemical descriptors, and of 14 fingerprints for small molecules, (iii) 8 types of amino acid descriptors, (iv) 13 whole protein sequence descriptors, and (iv) training, validation and visualization of predictive models. The second package, conformal, permits the calculation of confidence intervals for individual predictions in the case of regression, and P values for classification settings. The usefulness of PCM to concomitantly optimize compounds selectivity and potency is subsequently illustrated in the context of two application scenarios, which are: (a) modelling isoform-selective cyclooxygenase inhibition; and (b) large-scale cancer cell-line drug sensitivity prediction, where the predictive signal of several cell-line profiling data is benchmarked (among others): basal gene expression, gene copy-number variation, exome sequencing, and protein abundance data. Overall, the application of PCM in these two case scenarios let us conclude that PCM is a suitable technique to model the activity of ligands exhibiting uncorrelated bioactivity profiles across a panel of targets, which can range from protein binding sites (a), to cancer cell-lines (b).