DOI
https://doi.org/10.25772/CG3N-FZ56
Defense Date
2021
Document Type
Thesis
Degree Name
Master of Science
Department
Chemical and Life Science Engineering
First Advisor
Dr. James Ferri
Second Advisor
Dr. David Tyler McQuade
Third Advisor
Mr. William Glandorf
Abstract
Machine learning models for chemical property predictions are high dimension design challenges spanning multiple disciplines. Free and open-source software libraries have streamlined the model implementation process, but the design complexity remains. In order better navigate and understand the machine learning design space, model information needs to be organized and contextualized. In this work, instances of chemical property models and their associated parameters were stored in a Neo4j property graph database. Machine learning model instances were created with permutations of dataset, learning algorithm, molecular featurization, data scaling, data splitting, hyperparameters, and hyperparameter optimization techniques. The resulting graph contains over 83,000 nodes and 4 million edges and can be explored with interactive visualization software. The structure of the property graph is centered around models and molecules which enables efficient and intuitive inter- and intra-model evaluation. We use a curated lipophilicity dataset to demonstrate graph use cases. Difficult to predict molecules were identified across multiple models simultaneously. Powerful and expressive graph queries were implemented to identify molecular fragments that were both prevalent and associated with high lipophilicity prediction error.
Rights
© The Author
Is Part Of
VCU University Archives
Is Part Of
VCU Theses and Dissertations
Date of Submission
5-6-2021
Included in
Databases and Information Systems Commons, Data Science Commons, Other Chemical Engineering Commons, Other Chemistry Commons