- This event has passed.
AARMS Scientific Machine Learning Seminar: Hamid Usefi (MUN)
November 9, 2021 @ 11:00 am - 12:00 pm
Multicollinearity, singular vectors, and dimensionality reduction for high-dimensional datasets
Single nucleotide polymorphisms (SNPs) as building blocks of our DNA, can determine the variations between people. It is believed that SNPs in genes that regulate DNA mismatch repair, cell cycle regulation, metabolism and immunity are associated with genetic susceptibility to cancer. So, SNPs are potential diagnostic and therapeutic biomarkers in many cancer types. This in part has prompted the rapid advancements in DNA sequencing which makes it possible both in terms of cost and time to genetically sequence a single suspect tissue. The number of SNPs in a disease dataset varies from tens of thousands to several million. So, one of the bottlenecks of working with these genome datasets is their large-scale size that makes it difficult to render the data for meaningful analysis. Furthermore, in most diseases, there are at most a couple of hundred SNPs associated with the disease. So, simply put we would be looking for a needle in a haystack.
Machine learning algorithms are gaining increasing attention and believe to have great potential in answering many questions in this respect. A common problem in machine learning and pattern recognition is the process of identifying the most relevant features, specifically in dealing with high-dimensional datasets in bioinformatics. In this talk, I will discuss some of our recent work on a new feature selection method, called Singular-Vectors Feature Selection (SVFS). Part of this work is joint with my recently graduated PhD student Majid Afshar. It is stemmed from identifying linearly dependent columns of a matrix A. This problem can also be viewed as multicollinearity and subset selection in statistical modelling and arises in many contexts, including regression, ecology, and machine learning.
Let D = [A | b] be a labeled dataset, where b is the class label and features (attributes or SNPs) are columns of matrix A; rows of A can be viewed as samples. We show with examples as well as a sketch of proof that the projector matrix P_A onto the null space of A can be used to partition the columns of A into clusters so that columns in a cluster correlate only with the columns in the same cluster. In the first step, SVFS uses the projector P_D to find the cluster that contains b. We reduce the size of A by discarding features in the other clusters as irrelevant features. In the next step, SVFS uses the P_A of reduced A to partition the remaining features into clusters and choose the most important features from each cluster. I will discuss the performance of SVFS on genomic datasets compared to the state-of-the-art feature selection methods.
[ recording ]
- Meeting link: https://mun.webex.com/mun/j.php?MTID=m855b1e73549cf668f5b57190a7ef3eae
- Meeting number: 2633 483 6250
- Meeting password: W3FisMnJa86