Représentations pour l'apprentissage statistique à grande échelle en génomique

par Romain Menegaux

Projet de thèse en Bio-informatique

Sous la direction de Jean-Philippe Vert.

Thèses en préparation à Paris Sciences et Lettres , dans le cadre de Ingénierie des Systèmes, Matériaux, Mécanique, Énergétique , en partenariat avec Centre de Bio-informatique (laboratoire) et de École nationale supérieure des mines (Paris) (établissement de préparation de la thèse) depuis le 01-10-2017 .


  • Résumé

    The cost of DNA sequencing has been divided by 100,000 in the last 10 years. It is now so cheap that it has quickly become a routine technique to characterize the genomic content of biological samples with numerous applications in health, food or energy. The output of a typical DNA sequencing experiment is a set of billions of short sequences, called reads, of lengths 100~300 in the {A,C,G,T} alphabet ; these billions of reads are then automatically processed and analyzed by computers to get some biological information such as the presence of particular bacterial species in a sample, or of a specific mutation in a cancer. As the throughput of DNA sequencing continues to increase at a fast rate, the major bottleneck in many applications involving DNA sequencing is quickly becoming computational. The goal of this PhD project is to advance the state-of-the-art and propose new solutions for storing and analyzing efficiently the billions of reads produced by each experiment.

  • Titre traduit

    String embeddings for large-scale machine learning in genomics


  • Résumé

    The cost of DNA sequencing has been divided by 100,000 in the last 10 years. It is now so cheap that it has quickly become a routine technique to characterize the genomic content of biological samples with numerous applications in health, food or energy. The output of a typical DNA sequencing experiment is a set of billions of short sequences, called reads, of lengths 100~300 in the {A,C,G,T} alphabet ; these billions of reads are then automatically processed and analyzed by computers to get some biological information such as the presence of particular bacterial species in a sample, or of a specific mutation in a cancer. As the throughput of DNA sequencing continues to increase at a fast rate, the major bottleneck in many applications involving DNA sequencing is quickly becoming computational. The goal of this PhD project is to advance the state-of-the-art and propose new solutions for storing and analyzing efficiently the billions of reads produced by each experiment.