DOI
https://doi.org/10.25772/AA4E-RY91
Author ORCID Identifier
https://orcid.org/optimaldatasplittingmethods
Defense Date
2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy
Department
Systems Modeling and Analysis
First Advisor
Anh T Bui
Abstract
In predictive modeling, effective data splitting is crucial for creating statistically representative training and validation sets. The state-of-the-art data splitting methods are based on minimizing the energy distance between the split subsets. However, there are a number of limitations in the existing methods, which this dissertation aims to address. First, the existing methods were computationally inefficient. Thus, Chapter 2 proposes a method to scale up these approaches for big data. Here, we introduce scalable Twinning (s-Twinning), which significantly improves the execution speed of data splitting without sacrificing accuracy. Second, the existing methods did not consider the predictive relationship in the data. Hence, Chapter 3 develops an approach to incorporate this relationship. In chapter 3, we propose Alyke, a novel supervised data splitting method that heuristically minimizes energy distance while considering these relationships. Alyke not only enhances predictive power by providing better accuracy but also offers better scalability, making it suitable for big data applications and supervised data compression tasks. Third, the existing approaches were not designed to handle image data. Therefore, Chapter 4 addressed this drawback via incorporating deep neural networks and transfer learning. We propose the advancement of Alyke in chapter 4, Imalyke, to handle a difficult problem of identifying a representative subset for image dataset. Image data often exhibits complex spatial patterns, complicating the process of creating representative subsets for predictive modeling. Imalyke leverages the principles of Alyke combined with transfer learning techniques of deep neural networks to optimize the performance for image data. This method enables accurate predictions of image dataset and aligns with the critical need for reliability, in any image data applications. Through simulations and real-world examples, we demonstrate how all the three methods s-Twinning, Alyke, and Imalyke improve the efficiency and accuracy of data splitting across various domains, offering scalable and specialized solutions for modern predictive modeling challenges.
Rights
© The Author
Is Part Of
VCU University Archives
Is Part Of
VCU Theses and Dissertations
Date of Submission
8-7-2025