Détection des textes non-naturels
Auteur / Autrice : | Thomas Lavergne |
Direction : | François Yvon |
Type : | Thèse de doctorat |
Discipline(s) : | Informatique et réseaux |
Date : | Soutenance en 2009 |
Etablissement(s) : | Paris, Télécom ParisTech |
Mots clés
Mots clés contrôlés
Résumé
This thesis concerns unnatural language detection, especially in the context of fighting web spam. The main goal is to improve the quality of results produced by web search engines by automatically distinguishing between legitimate and fake content. In the first part, the thesis focuses on various kinds of fake content that can be found on the web, how it can be used to generate Web spam, and on the existing methods used to detect it. In the second part, a more general problem of the essence of unnatural texts is studied. Three definitions are proposed and illustrated through a taxonomy of such texts, the last one being a pragmatic definition usable in the context of automatic detection of unnatural texts. Te last part describes detection methods adapted to the different kinds of unnatural texts found in Web spam. These methods, based on statistical models, use the structure as well as the content of texts and are validated on both synthetic and real data.