Author ORCID Identifier

0000-0003-4540-4239

Defense Date

2026

Document Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Biostatistics

First Advisor

Jinze Liu

Second Advisor

Jasmohan Bajaj

Third Advisor

Le Kang

Fourth Advisor

Amy Olex

Fifth Advisor

Niloofar Ramezani

Abstract

This dissertation develops and validates an empirical framework for determining optimal sample sizes for machine learning (ML) models in biomedical research, with applications to both tabular clinical data and high-dimensional bulk RNA sequencing (RNA-Seq) data. Although ML methods are increasingly used for binary classification in healthcare, there is no unified approach for estimating the training sample size required to achieve stable predictive performance. Traditional power-based calculations are not directly applicable because ML emphasizes prediction rather than inference and relies on flexible, data-driven model structures.

To address this gap, a learning curve–based methodology was applied across 16 large public clinical datasets (n ≥ 50,000) Random Forest, XGBoost, and Neural Networks were studied, as well as multivariable logistic regression. For each dataset and algorithm, cross-validated area under the receiver operating characteristic curve (AUC) was evaluated at increasing training set sizes. The optimal sample size was defined as the smallest n at which performance was within a pre-specified margin (γ = 0.01, 0.02, or 0.05) of the full-dataset AUC, representing an optimal trade-off between discriminative performance and data collection requirements.

The dissertation further examined how dataset-level characteristics including class imbalance, feature dimensionality, proportion of continuous predictors, strength of linear signal, and degree of nonlinearity affected required sample sizes. Negative binomial regression models were developed to quantify these relationships and generate predictive equations for estimating training set sizes in new datasets.

This framework was extended to bulk RNA-Seq data, where additional cost constraints become limiting. 27 datasets were collected from validated sources, and an extensive simulation approach was used to artificially expand the data to increase reliability and robustness of the learning curve analysis. Following feature selection via differential expression analysis (DESeq2 + Boruta algorithm), ML classifiers were evaluated using a similar learning curve approach. Results demonstrated that the sample sizes required for stable ML prediction within the scope of bulk RNA-Seq data can differ substantially from those needed for differential gene expression testing alone.

For both clinical & bulk RNA-Seq frameworks, validation was conducted using large clinical cohorts from the Veterans Health Administration and NCBI Gene Expression Omnibus, and findings were compared with previously published works if applicable.

Finally, we packaged the above results into a user-friendly RShiny application, which allows future researchers to easily utilize our methodology. Use cases and example workflows are demonstrated.

Overall, this work provides a novel data-driven and algorithm-specific methodology and implementation for machine learning sample size determination within two large domains of biomedical data analysis.

Rights

© The Author

Is Part Of

VCU University Archives

Is Part Of

VCU Theses and Dissertations

Date of Submission

5-7-2026

Available for download on Friday, May 07, 2027

Share

COinS