extraction hypergraphique dirigée

par Loc Tran

Projet de thèse en Informatique, mathématique et applications

Sous la direction de Marc Bui et de Michel Amandry.

Thèses en préparation à l'Université Paris sciences et lettres , dans le cadre de École doctorale de l'École pratique des hautes études , en partenariat avec Cognition humaine et artificielle (laboratoire) et de EPHE PARIS (établissement opérateur d'inscription) depuis le 01-10-2018 .

• Résumé

In recent years, I had special interest in applying graph Laplacian to dimensional reduction methods, clustering methods, and semi-supervised learning (i.e. Laplacian Eigenmaps, spectral clustering, and graph-based semi-supervised learning). The applications of these methods are huge. In specific, the application of Laplacian Eigenmaps is image retrieval [1]. The application of spectral clustering is speech separation [2]. Finally, the application of graph-based semi-supervised learning is protein function prediction. Dimensional Reduction Methods In 2003, Mikhai Belkin and Partha Niyogi introduced a theoretical framework for Laplacian Eigenmaps [3]. Unlike Principle Component Analysis, Laplacian Eigenmaps can preserve the local structure of the data points after the mapping [3]. This is the strong point of Laplacian Eigenmaps. However, the authors of Laplacian Eigenmaps did not point out their eigenmap is the random walk normalized Laplacian Eigenmaps or symmetric normalized Laplacian Eigenmaps when they try to solve the generalized eigenvalue problem Ly=λDy (*), where L is un-normalized graph Laplacian matrix and D is the degree matrix. In particular, there exist two ways to solve this generalized eigenvalue problem and this fact will lead to two completely different eigenmaps. To the best of my knowledge, no-one have pointed out this fact. If I'm admitted, I will try to clarify in depth this fact. I will also develop the un-normalized Laplacian Eigenmaps. This eigenmap has not been investigated up to now. There are a lot of applications of Laplacian Eigenmaps such as image retrieval and some problems in bioinformatics [1, 4]. For problems in bioinformatics, to the best of my knowledge, the un-normalized Laplacian Eigenmaps and the random walk Laplacian Eigenmaps have not been applied to microarray data and SNP array. These Laplacian Eigenmaps not only overcome the curse of dimensionality problems of the data but also preserve the local structure of the new data points after the mapping. After applying these Laplacian Eigenmaps to microarray data and SNP data, these mappings will be compared to other mappings such as Principle Component Analysis (PCA) and Local Linear Embedding (LLE) methods. Possible Extension: In the past 20 years, multilinear algebra has gained special interest from a lot of famous scientist such as Gene Golub (Stanford) , Tamara Kolda (Sandia National Laboratories in USA). A lot of work has been done in this field such as Higher Order SVD [5], multilinear PCA [6], and multilinear LDA [7]. Orly Alter not only studies multilinear algebra in depth but also applies Higher Order Singular Value Decomposition to Integration of DNA Microarray Data from different study [8]. We have known that there exists a close relationship between SVD and PCA [9]. Hence multilinear PCA can also be applied to DNA microarray data. To the best of my knowledge, this work has not been done by anyone. Moreover, I can try to develop the weighted hypergraph Laplacian Eigenmaps and try to apply these novel methods to the zoo dataset available from UCI repository. Finally, I can also develop the hypergraph p-Laplacian Eigenmaps methods and try to apply these methods to the zoo dataset available from UCI repository. Spectral clustering Methods In the last three decades, spectral graph clustering is one of the most popular and famous clustering methods. It outperforms the k-mean clustering method and has been investigated in depth by a lot of computer scientists such as Ulrike Von Luxburg [10]. In particular, in 1992, Lars Hagen and Andrew Kahng developed un-normalized spectral graph clustering [11] while Shi-Malik and Ng-Jordan-Weiss developed normalized spectral graph clustering [12, 13]. The properties of the graph Laplacians and the connection of spectral graph clustering to graph partitioning problems have also been developed by these authors [10]. The application of spectral clustering is huge such as circuit partitioning [11] and image segmentation [12]. In detail, the normalized graph cut methods have not been applied to the circuit partitioning problems and the ratio cut has not been applied to the image segmentation problems. If I'm admitted, I will do a literature survey of these spectral clustering methods. For bioinformatics problems, we can apply spectral clustering methods to protein-protein interaction networks. Finally, I can also apply spectral clustering techniques to financial network. Possible Extension: The application of spectral clustering methods to time evolving graph (tensor data) has been developed by Lars Elden. Lars Elden represented his idea at IMA but did not publish any paper related to his interesting idea. I will try to understand his ideas and develop the similar methods to tensor graph data. Moreover, I can try to develop combinatorial and random walk hypergraph clustering methods and try to apply these novel methods to the zoo dataset available from UCI repository. Finally, I can also develop hypergraph p-Laplacian based clustering methods and apply these methods to the zoo dataset available from UCI repository. Graph-based semi-supervised learning In the past, Xiaojin Zhu and Dengyong Zhou have successfully developed graph-based semi-supervised learning. These methods also utilize graph Laplacians in them. In details, Xiaojin Zhu's method utilizes the random walk graph Laplacian [14] while Dengyong Zhou's methods utilize the symmetric normalized graph Laplacians [15]. The applications of these methods are huge such as digit recognition, text classification, and protein function prediction [15, 16]. In [16], the authors developed a graph-based semi-supervised learning utilizing un-normalized graph Laplacian to the protein function prediction problem. All the datasets are also available in their webpage. To the best of my knowledge, Xiaojin Zhu's method and Dengyong Zhou's methods have not been applied to protein function prediction problems. Moreover, Dengyong Zhou has also developed a hypergraph-based semi-supervised learning (hypergraph is the generalization of graph) [17]. In details, in graph model, edge can connect two vertices of the graph only. However, in hypergraph model, edge or hyperedge can connect more than two vertices of the graph. A family of the semi-supervised learning methods based on discrete graph operators has also been developed by Dengyong Zhou [18]. However, the discrete operators of graph based on combinatorial graph Laplacian has not been constructed and investigated. Moreover, to the best of my knowledge, the application of these methods to protein function prediction problem and cancer classification problem (i.e. bio-informatics problems) has not been investigated. If I'm admitted, I will do a literature survey of these methods from graph-based methods to hypergraph-based methods and also extend the family of semi-supervised learning methods based on discrete graph operator and discrete hypergraph operator and apply these methods to the protein function prediction problem (i.e. protein function classification) and cancer classification. References Xiaofei He. Laplacian Eigenmaps for Image Retrieval. Master's thesis, Computer Science Department, The University of Chicago, 2002. F. R. Bach and M. I. Jordan. Spectral clustering for speech separation. In J. Keshet and S. Bengio (Eds.), Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. New York: John Wiley, 2008. M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation , 15 (6):1373-1396, June 2003. Bartenhagen C, Klein HU, Ruckert C, Jiang X, Dugas M. Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data. BMC Bioinformatics 11:567, 2010. L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value de-composition. SIAM J. Matrix Anal. A., 21:1253-1278, 2000. Haiping Lu, K.N. Plataniotis, A.N. Venetsanopoulos. MPCA: multilinear principle component analysis of tensor objects. IEEE Trans. on Neural Networks, vol. 19, no.1, January 2008. Shu Kong, Donghui Wang. A Report on Multilinear PCA Plus Multilinear LDA to Deal with Tensorial Data: Visual Classification as An Example. Technical report, arXiv:1203.0744, 2012. L. Omberg, G. H. Golub and O. Alter. A Tensor Higher-Order Singular Value Decomposition for Integrative Analysis of DNA Microarray Data from Different Studies. Proceedings of the National Academy of Sciences (PNAS) USA 104 (47), November 2007. E. Kokiopoulou and Y. Saad. PCA and kernel PCA using polynomial filtering: a case study on face recognition. Report umsi-2004-213, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, MN, 2004. Ulrike Von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing 17(4), December 2007. Lars Hagen and Andrew Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE Trans on CAD of Integrated Circuits and Systems 11(9):1074-1085, 1992 Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), June 1997 A.Y Ng, M.I Jordan, Y. Weiss. On Spectral Clustering: Analysis and an algorithm. NIPS 2001 Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002