Modélisation de l'aquisition d'une seconde langue

par Margot Lacour

Projet de thèse en Informatique

Sous la direction de Alexandre Allauzen et de Guillaume Wisniewski.

Thèses en préparation à université Paris-Saclay , dans le cadre de École doctorale Sciences et technologies de l'information et de la communication (Orsay, Essonne ; 2015-....) , en partenariat avec LIMSI - Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (laboratoire) , TLP - Traitement du Langage Parlé (equipe de recherche) et de Université Paris-Sud (établissement de préparation de la thèse) depuis le 01-10-2018 .

  • Titre traduit

    Second Language Acquisition Modeling


  • Résumé

    Educational applications recently increase in popularity for different domains. These new educational tools rely on online platform such as Duolinguo or GymEnglish. These platforms collect (and make available) large amount of student learning data. With the recent development in multilingual natural language processing and machine learning, this opens new research tracks by using student trace data to support coursework and student progress in a educational perspective. Modeling Second Language Acquisition from student trace data is a real challenge since it involves the cross-lingual interaction of lexical knowledge, morpho-syntactic processing, and other skills. Moreover, most work in NLP for second language learners has focused on intermediate-to-advanced students of English in assessment and monolingual settings. Much less work has been done involving beginners, learners of languages other than English, or study over time. These aspects however correspond to real societal needs. This PhD project proposes to explore machine learning models of second language acquisition in a multilingual setting. The project will investigate different aspects of second language acquisition: - the design of models and features which generalize across languages; - the exploration of personalized and adaptive modeling strategies; - the prediction of language learning (and forgetting) over time. By accurately modeling student mistake patterns, the project will explore effective strategies to build personalized adaptive learning models. The expected road map is the following. The bibliography work will first focus on the recent development in second language acquisition and the different models of Learning/Forgetting Over Time. Then, the student will design and experiment different models to better characterize the challenges in terms of machine learning and natural language processing: the trade-off between feature engineering and model complexity, learning strategies and algorithms. The next part will be dedicated to either: the proposition of personalized adaptive learning model; the integration of features related to learner's native language in the model; the extension of this work to pronunciation learning. As a starting point, a recent challenge has distributed student trace data from users of Duolingo (one of the world's most popular online language-learning platform). The data contains transcripts of exercises completed by students over their first 30 days of learning. These transcripts are annotated for token (word) level mistakes, and the task is to predict what mistakes each learner will make in the future.