DOI

https://doi.org/10.25772/AA4E-RY91

Author ORCID Identifier

https://orcid.org/optimaldatasplittingmethods

Defense Date

2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Systems Modeling and Analysis

First Advisor

Anh T Bui

Abstract

In predictive modeling, effective data splitting is crucial for creating statistically representative training and validation sets. The state-of-the-art data splitting methods are based on minimizing the energy distance between the split subsets. However, there are a number of limitations in the existing methods, which this dissertation aims to address. First, the existing methods were computationally inefficient. Thus, Chapter 2 proposes a method to scale up these approaches for big data. Here, we introduce scalable Twinning (s-Twinning), which significantly improves the execution speed of data splitting without sacrificing accuracy. Second, the existing methods did not consider the predictive relationship in the data. Hence, Chapter 3 develops an approach to incorporate this relationship. In chapter 3, we propose Alyke, a novel supervised data splitting method that heuristically minimizes energy distance while considering these relationships. Alyke not only enhances predictive power by providing better accuracy but also offers better scalability, making it suitable for big data applications and supervised data compression tasks. Third, the existing approaches were not designed to handle image data. Therefore, Chapter 4 addressed this drawback via incorporating deep neural networks and transfer learning. We propose the advancement of Alyke in chapter 4, Imalyke, to handle a difficult problem of identifying a representative subset for image dataset. Image data often exhibits complex spatial patterns, complicating the process of creating representative subsets for predictive modeling. Imalyke leverages the principles of Alyke combined with transfer learning techniques of deep neural networks to optimize the performance for image data. This method enables accurate predictions of image dataset and aligns with the critical need for reliability, in any image data applications. Through simulations and real-world examples, we demonstrate how all the three methods s-Twinning, Alyke, and Imalyke improve the efficiency and accuracy of data splitting across various domains, offering scalable and specialized solutions for modern predictive modeling challenges.

Rights

Is Part Of

VCU University Archives

Is Part Of

VCU Theses and Dissertations

Date of Submission

8-7-2025

SujayMudalgiVita.pdf (230 kB)

Download

Included in

Applied Statistics Commons, Data Science Commons, Statistical Methodology Commons

COinS

Theses and Dissertations

Optimal Data Splitting Methods

DOI

Author ORCID Identifier

Defense Date

Document Type

Degree Name

Department

First Advisor

Abstract

Rights

Is Part Of

Is Part Of

Date of Submission

Included in

Browse

Search

Author Corner

Links

Theses and Dissertations

Optimal Data Splitting Methods

Author

DOI

Author ORCID Identifier

Defense Date

Document Type

Degree Name

Department

First Advisor

Abstract

Rights

Is Part Of

Is Part Of

Date of Submission

Included in

Share

Browse

Search

Author Corner

Links