Thèse soutenue

Stratégies optimistes en apprentissage par renforcement

FR  |  
EN
Auteur / Autrice : Sarah Filippi
Direction : Olivier CappéAurélien Garivier
Type : Thèse de doctorat
Discipline(s) : Signal et images
Date : Soutenance en 2010
Etablissement(s) : Paris, Télécom ParisTech

Résumé

FR  |  
EN

This thesis concerns model-based methods to solve reinforcement learning problems: these methods define a set of models which could explain the interaction between an agent and an environment. We consider different models of interaction : (partially observed) Markov decision processes and bandit models. We show that our novel algorithms perform well in practice and theoretically. The first algorithm consists of following an exploration policy during which the model is estimated and then an exploitation one. The duration of the exploration phase is controlled in an adaptative way. We then obtain a logarithmic regret for a parametric Markov decision problem even if the state is partially observed. This model is motivated by an application of interest in cognitive radio : the opportunistic access of a communication network by a secondary user. We are also interested in optimistic algorithms: the agent chooses the optimal actions for the best possible model. We construct such an algorithm in a parametric bandit model for a generalized linear model. We consider an online advertisement application. We then use the Kullback-Leibler divergence to construct the set of likely models in optimistic algorithms for finite Markov decision processes. This change in metric is studied in details and leads to significant improvement in practice.