DOI

https://doi.org/10.25772/HGH6-FN04

Defense Date

2021

Document Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Biostatistics

First Advisor

Dr. Mikhail Dozmorov

Abstract

High-throughput chromosome conformation capture technology (Hi-C) has revealed extensive DNA looping and folding into discrete 3D domains. These include Topologically Associating Domains (TADs) and chromatin loops, the 3D domains critical for cellular processes like gene regulation and cell differentiation. The relatively low resolution of Hi-C data (regions of several kilobases in size) prevents precise mapping of domain boundaries by conventional TAD/loop-callers. However, high resolution genomic annotations associated with boundaries, such as CTCF and members of cohesin complex, suggest a computational approach for precise location of domain boundaries.

We developed preciseTAD, an optimized machine learning framework that leverages a random forest model to improve the location of domain boundaries. Our method introduces three concepts - shifted binning, distance-type predictors, and random under-sampling - which we use to build classification models for predicting boundary regions. The algorithm then uses density-based clustering (DBSCAN) and partitioning around medoids (PAM) to extract the most biologically meaningful domain boundary from models trained on high-resolution genome annotation data and boundaries from low-resolution Hi-C data. We benchmarked our method against a popular TAD-caller and a novel chromatin loop prediction algorithm.

Boundaries predicted by preciseTAD were more enriched for known molecular drivers of 3D chromatin including CTCF, RAD21, SMC3, and ZNF143. preciseTAD-predicted boundaries were more conserved across cell lines, highlighting their higher biological significance. Additionally, models pre-trained in one cell line accurately predict boundaries in another cell line. Using cell line-specific genomic annotations, the pre-trained models enable detecting domain boundaries in cells without Hi-C data.

The research presented provides a unified approach for precisely predicting domain boundaries. This improved precision will provide insight into the association between genomic regulators and the 3D genome organization. Furthermore, our methods will provide researchers with flexible and easy-to-use tools to continue to annotate the 3D structure of the human genome without relying on costly high resolution Hi-C data. The preciseTAD R package and supplementary ExperimentHub package, preciseTADhub, are available on Bioconductor (version 3.13; https://bioconductor.org/packages/preciseTAD/; https://bioconductor.org/packages/preciseTADhub/).

Rights

© The Author

Is Part Of

VCU University Archives

Is Part Of

VCU Theses and Dissertations

Date of Submission

5-10-2021

Included in

Biostatistics Commons

Share

COinS