Defense Date


Document Type


Degree Name

Doctor of Philosophy



First Advisor

Kellie J. Archer


The advent of high-throughput sequencing has brought about the creation of an unprecedented amount of research data. Analytical methodology has not been able to keep pace with the plethora of data being produced. Two assays, ImmunoSEQ and the cytokinesisblock micronucleus (CBMN), that both produce count data and have few methods available to analyze them are considered.

ImmunoSEQ is a sequencing assay that measures the beta T-cell receptor (TCR) repertoire. The ImmunoSEQ assay was used to describe the TCR repertoires of patients that have undergone hematopoietic stem cell transplantation (HSCT). Several different methods for spectratype analysis were extended to the TCR sequencing setting then applied to these data to demonstrate different ways the data set can be analyzed. The different methods include CDR3 distribution perturbation, Oligoscores, Simpson's diversity, Shannon diversity, Kullback-Liebler divergence, a non-parametric method and a proportion logit transformation method. Herein we also demonstrate adapting compositional data analysis methods to the TCR sequencing setting. The various methods were compared when analyzing a set of 13 subjects who underwent hematopoietic stem cell transplantation. The eight subjects who developed graft versus host disease were compared to the five who did not. There was no little overlap in the results of the different methods showing that researchers must choose the appropriate method for their research question of interest.

The CBMN assay measures the rate of micronuclei (MN) formation in a sample of cells and can be paired with gene expression or methylation assays to determine association between MN formation and other genetic markers. Herein we extended the generalized monotone incremental forward stagewise (GMIFS) method to the situation where the response is count data and there are more independent variables than there are samples. Our Poisson GMIFS method was compared to a popular alternative, glmpath, by using simulations and applying both to real data. Simulations showed that both methods perform similarly in accurately choosing truly significant variables. However, glmpath appears to overfit compared to our GMIFS method. Finally, when both methods were applied to two data sets GMIFS appeared to be more stable than glmpath.


© The Author

Is Part Of

VCU University Archives

Is Part Of

VCU Theses and Dissertations

Date of Submission