Kernel-based Partial Sufficient Variable Screening and Dimension Reduction with Categorical Controls
DOI
https://doi.org/10.25772/CKCS-BY33
Defense Date
2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy
Department
Systems Modeling and Analysis
First Advisor
Dr. Chenlu Ke
Abstract
Variable selection and dimension reduction are two fundamental components of modern statistical and machine learning methodologies for analyzing high-dimensional datasets, which have become increasingly prevalent in the era of Big Data. However, most existing methods primarily focus on continuous data, despite many practical datasets containing both continuous and categorical variables. To address this challenge, this dissertation develops novel approaches for variable screening and dimension reduction specifically tailored to high-dimensional data involving categorical predictors.
For regression analyses with mixed predictor types, we propose a unified framework that constrains sufficient reduction of continuous variables through subpopulations defined by categorical variables. Leveraging reproducing kernel-based ANOVA statistics, a model-free extension of classical ANOVA methods used in linear models, we identify important individual predictors and linear combinations of predictors without imposing stringent modeling assumptions. Unlike traditional marginal screening methods, our screening approach evaluates each predictor in the presence of others, and hence improves variable selection accuracy. Following the identification of candidate predictors, we further introduce a kernel-based sequential least squares method that efficiently reduces dimensionality by extracting a few critical linear combinations from the selected predictors. Compared to existing partial sufficient dimension reduction methods, our technique offers greater flexibility as it requires neither predefined model structures nor strong assumptions about predictor distributions. Additionally, our method accommodates both continuous and categorical response variables and does not rely on slicing when dealing with continuous responses. Theoretical and computational aspects of the proposed methods are developed. Comprehensive simulation studies demonstrate their effectiveness across various regression and classification scenarios, supported by illustrative real-data applications.
Rights
© The Author
Is Part Of
VCU University Archives
Is Part Of
VCU Theses and Dissertations
Date of Submission
5-9-2025