Stratégies optimistes en apprentissage par renforcement
Auteur / Autrice : | Sarah Filippi |
Direction : | Olivier Cappé, Aurélien Garivier |
Type : | Thèse de doctorat |
Discipline(s) : | Signal et images |
Date : | Soutenance en 2010 |
Etablissement(s) : | Paris, Télécom ParisTech |
Mots clés
Résumé
This thesis concerns model-based methods to solve reinforcement learning problems: these methods define a set of models which could explain the interaction between an agent and an environment. We consider different models of interaction : (partially observed) Markov decision processes and bandit models. We show that our novel algorithms perform well in practice and theoretically. The first algorithm consists of following an exploration policy during which the model is estimated and then an exploitation one. The duration of the exploration phase is controlled in an adaptative way. We then obtain a logarithmic regret for a parametric Markov decision problem even if the state is partially observed. This model is motivated by an application of interest in cognitive radio : the opportunistic access of a communication network by a secondary user. We are also interested in optimistic algorithms: the agent chooses the optimal actions for the best possible model. We construct such an algorithm in a parametric bandit model for a generalized linear model. We consider an online advertisement application. We then use the Kullback-Leibler divergence to construct the set of likely models in optimistic algorithms for finite Markov decision processes. This change in metric is studied in details and leads to significant improvement in practice.