Étude des facteurs de pertinence dans la recherche de microblogs

par Firas Damak

Thèse de doctorat en Image, information, hypermedia

Soutenue en 2014

à Toulouse 3 .


  • Résumé

    Notre travail se situe dans le contexte de recherche d'information (RI) sociale et s'intéresse plus particulièrement à la recherche de microblogs. Les modèles de RI doivent s'adapter aux spécificités des microblogs : fraîcheur, aspect social et spécificités syntaxiques doivent ainsi être pris en compte. Nos travaux visent à améliorer la qualité des résultats de recherche d'information adhoc dans les microblogs. Nos contributions se situent à plusieurs niveaux : -Nous avons mené à une analyse de défaillance d'un modèle de recherche usuel. Nous avons trouvé que le problème principal vient de la concision des microblogs. Cette concision engendre une correspondance limitée entre les termes des microblogs et les termes des requêtes, même s'ils sont sémantiquement similaires. -Afin de compenser l'impact de la concision des microblogs, nous avons proposé d'étendre les requêtes (i) en exploitant des ressources de type actualités, (ii) en applicant des techniques de réinjection de pertinence. Nous avons enfin étendu les microblogs grâce aux liens (URLs) qu'ils contiennent. Nos expérimentations ont montré que l'emploi des URLs et l'expansion de requêtes sont primordiales pour la RI dans les microblogs. - Nous avons repris les critères souvent utilisés dans l'état de l'art et nous les avons évalués. Nous avons montré que les critères en relation avec les URLs sont les plus discriminants. -Afin de prendre en compte l'aspect temporel dans la restitution des microblogs pertinents, nous avons proposé trois méthodes qui intègrent le temps dans le calcul de la pertinence. Cette intégration du temps n'a cependant pas montré son intérêt dans nos méthodes.

  • Titre traduit

    Study of salient factors for microblog search


  • Résumé

    This work deals with the context of social information retrieval (IR), more particularly the retrieval of microblogs. Microblogs are messages of short length. They contain information on various topics :opinions, events, articles. . . Microblogs represent a significant part of the information generated on the Web. In the case of Twitter, the most popular platform, the number of microblogs can reach 500 million per day. Microblogs have a different form from traditional documents. Their length is reduced compared to traditional blogs and articles on the web (only 140 characters in the case of Twitter). Moreover, microblogs can have specific syntax such as #hashtags, @mentions or shortened URLs. . . Microblogging platforms are a social network model different from other social networks. Relationships between users are not necessarily reciprocal and subscriptions are unrestricted between microbloggers. Users of microblogging platforms do not only produce but they also search for information. The motivations of this research are diverse. Some are inspired from Web search (e. G. The search for news) and others are specific to the search for microblogs (e. G. Real-time search or social information). In Twitter, 1. 6 billion queries are issued every day. Though, the IR models must adapt to the specificities of microblogs: freshness, social aspect and syntactic characteristics must therefore be taken into account. The aim of our work is to improve the quality of the results of adhoc information retrieval in microblogs. Our contributions are at several levels: -In order to accurately determine the factors limiting the performance of conventional models of search in a corpus of microblogs, we conducted an analysis of failure of a conventional model search. We selected relevant microblogs. However, they are not found by the search pattern. Then, we identified the factors preventing their return. We found that the main problem is the shortness of microblogs. -To offset the impact of the shortness of microblogs, we proposed and tested several solutions: to extend the queries by (i) exploiting news articles, (ii) using the WordNet lexical database, (iii) applying techniques of relevance feedback of the state of art which often proved effective: Rocchio to identify terms likely to bring relevance and for weighting the terms of the new query, and the natural extension mechanism queries of the BM25 model. Using Rocchio, we tested different methods of calculating the weight of expansion terms. We finally extended microblogs thanks to the links (URLs) they contain. Our experiments have shown that the use of URLs and the expansion of the query are crucial for IR in microblogs. Most of these experiments (expansion of queries and microblogs) were performed on the basis of the vector model and the probabilistic model, as a model of restitution. This allowed us to compare the behavior of the two models on microblogs and with the two types of expansion. In general, we found that the Vector Space Model is more efficient than the probabilistic one in the selection of relevant microblogs (better recall). However, the probabilistic model puts more value on relevant microblogs returned over all returned microblogs (better precision). -A second part of our work is concerned with the study of the features used to identify relevant microblogs. We selected the features often used in the state of art (content features, features on the importance of authors, URLs features and quality features). Then, we evaluated them. We conducted this analysis in 3 axes. In the first axis, (i) we studied the behavior of the features in the relevant documents and compared them with their behavior in non-relevant documents. In the second axis, (ii) we analyzed the impact of the combination of the features scores with the content's score, calculated with a model of conventional IR. In the third axis, (iii) we used learning techniques as well as algorithms of feature selection that may be useful as input to the learning techniques. In general, we have shown that the features related to URLs posted in tweets are the most discriminating. The features related to the authors do not reflect the relevance. -To take into account the temporal aspect when selecting relevant microblogs, we have proposed three methods that incorporate time in the calculation of relevance. However, this integration of time did not show any positive impact in our methods.

Consulter en bibliothèque

La version de soutenance existe sous forme papier

Informations

  • Détails : 1 vol. (136 p.)
  • Annexes : Bibliogr. p. 126-136

Où se trouve cette thèse ?

  • Bibliothèque : Université Paul Sabatier. Bibliothèque universitaire de sciences.
  • Disponible pour le PEB
  • Cote : 2014 TOU3 0106
Voir dans le Sudoc, catalogue collectif des bibliothèques de l'enseignement supérieur et de la recherche.