Thèse soutenue

Analyse des Représentations Latentes des Modèles de Text-To-Speech Neuronaux pour le Contrôle de la Synthèse Audio-Visuelle Expressive

FR  |  
EN
Auteur / Autrice : Martin Lenglet
Direction : Gérard BaillyOlivier Perrotin
Type : Thèse de doctorat
Discipline(s) : Signal, image, paroles, télécoms
Date : Soutenance le 12/12/2023
Etablissement(s) : Université Grenoble Alpes
Ecole(s) doctorale(s) : École doctorale électronique, électrotechnique, automatique, traitement du signal (Grenoble ; 199.-....)
Partenaire(s) de recherche : Laboratoire : Grenoble Images parole signal automatique (2007-....)
Jury : Président / Présidente : Didier Schwab
Examinateurs / Examinatrices : Gustav Eje Henter
Rapporteurs / Rapporteuses : Marie Tahon, Simon King

Résumé

FR  |  
EN

In recent years, deep neural architectures display groundbreaking performances in various speech processing area, including Text-To-Speech (TTS). Models have grown bigger, including more layers and millions of trainable parameters to achieve almost natural synthesis, at the expense of interpretability of computed intermediate representations, called embeddings. However, statistical learning performed by these neural models constitutes a valuable source of information about language. This presentation aims at openning this "black box" to explore intermediate embeddings computed by state-of-the-art TTS models. By identifying phonetic and acoustic features in model representations, the proposed methods help understanding how neural TTS are able to organize speech information on an unsupervised manner and provide new insights on phonetic regularities captured by statistical learning on massive data that are beyond human expertise. This work open the route toward designing more careful control architectures for neural TTS, without the need for additional data nor training process. These results led us to propose an auxiliary module for expressive synthesis called Local Style Tokens (LST), which models local variations in prosody with respect to the type of embeddings to bias.