Modélisation de séquences Biologiques: du Machine Learning à la Physique Statistique

par Marco Molari

Projet de thèse en Physique

Sous la direction de Simona Cocco et de Rémi Monasson.

Thèses en préparation à Paris Sciences et Lettres , dans le cadre de École doctorale Physique en Île-de-France (Paris) , en partenariat avec LABORATOIRE DE PHYSIQUE STATISTIQUE DE L'E.N.S. (laboratoire) et de École normale supérieure (Paris ; 1985-....) (établissement de préparation de la thèse) depuis le 01-09-2016 .


  • Résumé

    The project is focused on the application of Machine Learning techniques on biological sequence distributions. In particular we will employ Restricted Boltzmann Machines with non-linear activation functions. We will first test them on artificial lattice proteins, which constitute a good theoretical benchmark. We will focus on the functional meaning of the features detected by the hidden variables of the RBM. We will also consider more complex architectures, with more than one layer of hidden units. As a second step, RMB will be applied to experimental data obtained by the group of C. Nizak in ESPCI. Their high-throughput techniques allows for the measurements of mutational effect up to 10^7 mutants per day. We will use our RMB model to predict wether a mutation will damage or not the protein function and structure. Moreover we will study how the mutation influences the binding with a ligand. This latter study will involve dynamical versions of RBMs, the so-called conditional RBMs.

  • Titre traduit

    Modelling of Biological Sequences: from Machine Learning to Statistical Physics


  • Résumé

    The project is focused on the application of Machine Learning techniques on biological sequence distributions. In particular we will employ Restricted Boltzmann Machines with non-linear activation functions. We will first test them on artificial lattice proteins, which constitute a good theoretical benchmark. We will focus on the functional meaning of the features detected by the hidden variables of the RBM. We will also consider more complex architectures, with more than one layer of hidden units. As a second step, RMB will be applied to experimental data obtained by the group of C. Nizak in ESPCI. Their high-throughput techniques allows for the measurements of mutational effect up to 10^7 mutants per day. We will use our RMB model to predict wether a mutation will damage or not the protein function and structure. Moreover we will study how the mutation influences the binding with a ligand. This latter study will involve dynamical versions of RBMs, the so-called conditional RBMs.