Une approche réaliste de la détection de piétons multi-vues et multi-représentations pour des scènes extérieures

par Nicola Pellicanò

Projet de thèse en Traitement du signal et des images

Sous la direction de Sylvie Le hegarat et de Emanuel Aldea.

Thèses en préparation à Paris Saclay , dans le cadre de Sciences et Technologies de l'Information et de la Communication , en partenariat avec SATIE - Systèmes et Applications des Technologies de l'Information et de l'Energie (laboratoire) , MOSS - Méthodes et outils pour les Signaux et Systèmes (equipe de recherche) et de Université Paris-Sud (établissement de préparation de la thèse) depuis le 01-10-2015 .

  • Résumé

    Le cadre applicatif de notre étude est l'analyse de la dynamique d'une foule très dense. L'objectif est de détecter et de suivre en tant que particules des milliers de participants qui forment une foule très dense, et à partir des observations, d'avancer dans la direction de la proposition et de la validation d'un modèle d'interaction entre personnes. Notre projet est original en ce qui concerne les aspects focalisation et analyse au niveau de particules, ainsi que par l'importance qu'on donne à l'analyse à grande échelle. La stratégie prospective est basée sur un bouclage entre la détection et le suivi temporel, qui devrait nous permettre de gérer le problème fondamental de cette application - l'incertitude des associations. L'intérêt de cette étude est fondé sur le besoin de disposer des outils pour comprendre la dynamique des foules très denses, et ainsi de pouvoir proposer de meilleures solutions pour le design des environnements urbains et des infrastructures de transport qui soit optimal du point de vue de l'efficacité des flots impliqués, mais aussi par rapport à l'amélioration de la qualité de vie des citoyens. Un autre intérêt majeur de cette recherche est le potentiel pour la prévention des tragédies dans les événements sociaux de grande échelle. Au terme de ce projet, nous souhaitons proposer une méthodologie pour l'analyse de foules très denses qui bénéficie des dernières avancées des algorithmes de suivi mono camera, et qui propose également des solutions pour l'association multi-camera. En outre, notre intention est de soutenir la communauté de recherche par la mise en place d'une base de données multi-camera qui devrait permettre une implication plus forte de la part d'autres communautés connexes impliquées dans l'étude des foules (physique, contrôle, simulations, sociologie).

  • Titre traduit

    Tackling pedestrian detection in large scenes with multiple views and representations

  • Résumé

    The main purpose of this research project is to enable us to understand some mechanisms of the evolution of human interactions in high-density crowds (more than 4-5 people per m2 ). The value of such a study rests on the need for better solutions for securing infrastructures as to not only improve the efficiency of the flows involved, but also do it in such a way as to prevent fatalities during large scale events and gatherings, or in areas which are regularly characterized by critical pedestrian densities. A well documented example is a tragic incident at Makkah in 2006 when 363 people died in a large scale stampede. By analyzing video sequences acquired during that event, abnormal patterns in the flow could have been identified as early as 30 minutes before the tragedy happened [4]. However, prevention requires a thorough understanding of the dynamics within dense crowds. Until now, their modelling using video data has been quite limited and only specific phenomena have been documented, namely stop-and-go or laminar flow. The research contributions are generally based on simulations which rely on pedestrian interaction models. Studies that use real data are scarce ; but are of utmost necessity to improve and validate the pedestrian simulations. A shortcoming of simulation environments is that the interaction models are adequate for moderately crowded environments, but they do not scale realistically when pedestrian density is very high. In the case of real data i.e. recordings of dense crowd movement, the extraction of pedestrian trajectories has been performed either by human operators, a process which is time consuming and cumbersome, or in an unsupervised manner but only in specific conditions i.e. vertical cameras and using primitive methods. In both cases, a major hindrance is the strong occlusion among pedestrians which makes extracting accurate trajectories or accurate local density information nearly impossible. Some studies estimate the local density by exploiting optical flow or by inferring for a wider area the number of pedestrians based on the estimation of an occlusion factor. However, such an analysis at a macroscopic level prevents the observation of the intricate interactions which are essential for proposing an accurate pedestrian model adapted to crowds. This limits the ability to extrapolate accurately the behavior of a crowd either on a time scale or in a different setting. From a technical point of view, the problem of occlusion can not be solved robustly by employing single camera recordings. As the interest of the computer vision community extended gradually from single pedestrian tracking in uncluttered scenes to crowd analysis, it has become clear that multiple camera networks are required. A small scale experiment proposed in [3] has proved the potential of multiple camera tracking in occluded scenes. However, extending this type of solution to large scale scenarios raises several scientific and technical questions, and relying on multiple camera networks in wide area crowded areas is still considered challenging. A first deterrent is the joint calibration of cameras in a large crowded area, which is required in order to project image content in different views and thus perform object association. Calibration relies on specific targets introduced in the common field of view, or on salient features within the scene. The location of the ground plane is also generally required. In [3] for example, LED poles should be fully visible in the scene to solve the relative camera pose problem. No existing solutions apply to scenarios where the camera fields of view are entirely focused on a homogeneous moving mass 100m away. Recently, we have proposed an original solution for the calibration problem [1] which is based on synchronized stereo rigs, each of them featuring a long focal camera which provides high resolution images of the crowd and a wide angle camera which is used for the relative pose estimation within the rig network. This solution allows us to align the long focal cameras accurately and to constrain the pedestrian identification in different views along epipolar lines. The goal of the present proposal is to tackle the major challenge of detecting and tracking simultaneously as particles thousands of pedestrians forming a high-density crowd, and based on real data observations, to propose and validate a particle interaction model for crowd flow, which is a key element for predicting the dynamics of the crowd, and for assisting the security decision planners. The strategy we propose in order to analyse this type of system is based on a feedback mechanism involving particle segmentation and tracking, and it is intended to cope with the main difficulty of this problem : the uncertainty of data association The uncertainty with regards to the identification of individuals in the crowd at a given time, i.e. association among views and segmentation, arises mainly from the significant variation of orientation among cameras, from the limited object resolution in pixels and from the homogeneity of the images. Whilst we can always adapt the hardware to tackle the first two problems, there are significant draw- backs. Beside the cost of increasing the numbers of rigs and their resolution, oftentimes there are also logistical limitations as to where it is possible to place them, as in Makkah. For object association, we aim to propose a robust descriptor based on the local topology able to cope with significant variations of orientation. As a comparison, the method used in [3] relies entirely on the intensity level of pixels which is not adapted given the variations in intensity among viewpoints and the inherent imprecision of the calibration process. Secondly, we intend to propose a descriptor with a graceful degradation of the accuracy, and to estimate the expected performance in terms of orientation variation. Also in order to cope with the relative homogeneity of the crowd images, we propose to introduce a feedback mechanism binding the object association to the following stages of the algorithm. The underlying idea is that some elements in the crowd (i.e distinctively white hat, dark haired pedestrian on light background etc.) exhibit a better saliency and they may be used iteratively as constraints in order to decrease the uncertainty level of surrounding hypotheses, based on spatial consistency. From the segmentation point of view, this means that we may decrease the potential association space from one image to another to convex areas determined by the closest salient features in the initial view. From the tracking (temporal association of particles) point of view, we may bootstrap the system using the salient features, thus exploiting the topological properties of the deformation of planar graphs. Afterwards, we may perform tracking of generic particles, again with extra constraints provided by the first step. Both segmentation and tracking contribute thus to decreasing the estimation uncertainty of the particle system and may be used iteratively until the estimation reaches a stable state in space and time. Regular segmentation methods based on gradient information, mathematical morphology or learning of prior models where applicable are also powerful tools, already used in the research community, which we intend to employ jointly with the association among images to highlight individuals. We expect nevertheless to observe that the multiple view aspect of our approach will be decisive in leveraging the fundamental problems of occlusion and homogeneity - see for example the results in [1]. We have started exploring the applicability of discriminative learning to pixel-level head detection in single camera views [2], and we intend to follow the strategy mentioned earlier by tracking detection cues in single cameras and by performing data fusion among multiple vues. Regarding the tracking mechanism, the scientific challenge of our proposal arises from the size of the task, which has to be considered as a large scale multiple object problem, where occlusions might require associations among observations which are distant on the time scale. We expect thus to encounter tractability challenges for inferring the optimal global state of the system which will drive us to adapt the current solutions used within the community. The validation of the results is a scientific challenge in itself ; we intend to exploit the multi-camera character of our approach in order to determine the ground truth more reliably, and also be able to provide a high quality data set for other single view or multiple view studies.