Thèse soutenue

Détection des textes non-naturels

FR  |  
EN
Auteur / Autrice : Thomas Lavergne
Direction : François Yvon
Type : Thèse de doctorat
Discipline(s) : Informatique et réseaux
Date : Soutenance en 2009
Etablissement(s) : Paris, Télécom ParisTech

Mots clés

FR

Mots clés contrôlés

Résumé

FR  |  
EN

This thesis concerns unnatural language detection, especially in the context of fighting web spam. The main goal is to improve the quality of results produced by web search engines by automatically distinguishing between legitimate and fake content. In the first part, the thesis focuses on various kinds of fake content that can be found on the web, how it can be used to generate Web spam, and on the existing methods used to detect it. In the second part, a more general problem of the essence of unnatural texts is studied. Three definitions are proposed and illustrated through a taxonomy of such texts, the last one being a pragmatic definition usable in the context of automatic detection of unnatural texts. Te last part describes detection methods adapted to the different kinds of unnatural texts found in Web spam. These methods, based on statistical models, use the structure as well as the content of texts and are validated on both synthetic and real data.