Detection and identification of non-human sequence in next generation sequencing data of human tumors : a bioinformatics approach.

par Alexis Robitaille

Projet de thèse en Aspects moleculaires et cellulaires de la biologie

Sous la direction de Massimo Tommasino et de Magali Olivier.

Thèses en préparation à Lyon en cotutelle avec Centre international de recherche sur le cancer , dans le cadre de Biologie Moléculaire Intégrative et Cellulaire (BMIC) , en partenariat avec Unité de Recherche Hors Contrat (equipe de recherche) depuis le 17-11-2016 .

  • Titre traduit

    Detection et identification des séquences non-humaines dans les données de séquençage nouvelle generation de tumeurs humaines : une approche bioinformatique

  • Résumé

    It is now well established that some viruses have been proved to be etiologic agents of human cancer and cause 15 % to 20 % of all human tumors worldwide1,2. Moreover, epidemiological studies indicates that the list of human oncogenic pathogens will grow in the future for a variety of cancer types3. Infection with these viruses seems to be an essential, but not sufficient, step in the multistage process of carcinogenesis. Other changes, induced for instance by chemical carcinogens or radiation, are also required to change the virus infected cell into a tumor cell4. The main infectious agents contributing to the cancer burden were H pylori (gastric cancer), HPV (cervical cancer), HBV (liver cancer), and HCV (liver cancer), which together accounted for 92% of all infection-attributable cancers worldwide3. But progress to identify viruses as causative agents of human cancers has been slow and made difficult by the lack of good methods to rigorously detect and evaluate these organisms. Next generation sequençing technologies offers a unique opportunity for the identification of new tumor-associated human pathogens. The International Cancer Genome Consortium5 and other increasing cancer sequence databases such as The Cancer Genome Atlas6 allow an in-depth analysis of the non-human sequence resulting from complete human tumor genomes and trancriptomes. This databases offers a desirable ressources of thousands of complete human tumor genomes and transcriptomes with major advantages being the unbiased detection of all known pathogens and even detect minute amounts of viral presence. However, the screening of whole tumor transcriptome and genome sequençing data can be fully realized only in conjonction with the development of new and powerfull bioinformatics tools. The prediction of both exogenous and endogenous pathogen nucleotide sequences in high-throughput sequençing data require computational techniques to align and compare the human tumor sequence against the known viruses sequence7,8. 1. Javier, R. T. & Butel, J. S. The History of Tumor Virology. Cancer Res. 68, 7693–7706 (2008). 2. Cerwenka, A. & Lanier, L. L. Natural killer cells, viruses and cancer. Nat. Rev. Immunol. 1, 41–49 (2001). 3. Plummer, M. et al. Global burden of cancers attributable to infections in 2012: a synthetic analysis. Lancet Glob. Health 4, e609–e616 (2016). 4. Mrl, K., Hj, van K., Cf, van K., Pa, S. & Am, van L. Viruses and cancer. Virussen en kanker (1992). 5. ICGC. International Cancer Genome Consortium. Available at: (Accessed: 17th May 2016) 6. TCGA. The Cancer Genome Atlas Home Page. The Cancer Genome Atlas - National Cancer Institute Available at: (Accessed: 17th May 2016) 7. Borozan, I., Watt, S. N. & Ferretti, V. Evaluation of Alignment Algorithms for Discovery and Identification of Pathogens Using RNA-Seq. PLOS ONE 8, e76935 (2013). 8. Borozan, I., Watt, S. & Ferretti, V. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification. Bioinformatics 31, 1396–1404 (2015).

  • Pas de résumé disponible.