Thèse en cours

Modélisation et l'industrialisation de Solutions d'Intelligence Artificielle Adapté aux Flux de Données Evolutives et Hétérogènes Issues d'Infrastructure Bancaire Complexe

FR  |  
EN
Auteur / Autrice : Mariam Barry
Direction : Albert BifetRaja Chiky
Type : Projet de thèse
Discipline(s) : Informatique, données, IA
Date : Inscription en doctorat le 15/11/2019
Etablissement(s) : Institut polytechnique de Paris
Ecole(s) doctorale(s) : École doctorale de l'Institut polytechnique de Paris
Partenaire(s) de recherche : Laboratoire : Laboratoire de Traitement et Communication de l'Information
Equipe de recherche : DIG – Data, Intelligence and Graphs

Résumé

FR  |  
EN

Artificial intelligence (AI) is able to extract valuable insights and enables real-time decision-making from diverse and ever-evolving data sources. However, the use of AI models in industrial applications, especially when dealing with data from diverse sources, presents a multitude of significant challenges due to the evolving and changing distribution of industrial data. With the rise of the Internet of Things and Industry 4.0, the number of digital devices has considerably increased, generating more and more heterogeneous and unstructured data. There is a need for adaptive models that can continuously learn and cope with the dynamic nature of input data. Online machine learning, a sub-field of machine learning, allows models to adapt to evolving patterns and trends in the data without the need for periodic downtime and complete model replacement. Critical applications in various domain such as cybersecurity, finance and healthcare require this type of online analysis of data streams. The research conducted during this thesis aims to develop adaptive and scalable streaming machine learning solutions to learn from heterogeneous streaming data that can be operationalized with large-scale infrastructures, in particular in the banking sector. In this thesis, we addressed different AI challenges in the streaming setting: from big data summarization, and dynamic knowledge graph construction to online change detection and the operationalization aspects for streaming models in production. First, we propose an incremental algorithm and system (StreamFlow) for big data summarization that produces feature vectors suited for both batch and online Machine Learning tasks. These enriched features are used to train batch and online machine-learning models, improving performance in terms of both time and accuracy. Second, we propose Stream2Graph, a stream-based solution for building and updating knowledge graphs dynamically in an incremental way. Experiments demonstrated that graph features combined with online learning considerably improved ML results. Third, we propose StreamChange, an explainable online change detection model with desirable properties for big data streaming such as constant space and time complexity. Experiments on real-world data show that we outperform state-of-the-art models in both gradual and abrupt changes. Finally, we demonstrate how to operationalize online machine learning in production, allowing for horizontal scaling and incremental learning from streaming data in real time. Experiments using millions of dimensions in feature-evolving dataset demonstrate the effectiveness of our MLOps pipelines. Our design ensures model versioning, and monitoring, with audibility, and reproducibility, confirming the efficiency of using online learning models over batch methods regarding both time and space complexity.