Defense Date


Document Type


Degree Name

Doctor of Philosophy


Computer Science

First Advisor

Lukasz Kurgan


Computational prediction of compound-protein interactions generated a substantial amount of interest in the recent years owing to the importance of the knowledge of these interaction for drug discovery and drug repurposing efforts. Research suggests that the currently known drug targets constitute only a fraction of a complete set of drug targets, limiting our ability to identify suitable targets to develop new drugs or to repurpose current drugs for new diseases. These efforts are further thwarted by our limited knowledge of protein-drug (and more generally protein-compound) interactions, where only a subset of drug targets is typically known for the currently used drugs. This thesis focuses on the most populous category of drug targets, which are proteins, and addresses three main goals. The first goal is to computationally characterize the current drug targets among human proteins in order to identify a collection of markers that can be used to find novel/potential drug targets. We discover several useful markers that can be used to accelerate the process of identifying previously unknown drug targets. The second goal investigates potential weaknesses in the context of computational prediction of interaction between proteins and compounds. We find that current predictors of compound-protein interactions often rely on similarity between drugs and compounds to make predictions, i.e., they predict interactions with compounds that are similar to the compounds that are known to interact with a given protein and vice versa. We note that proteins are often composed of discernable units, called domains, and some of them play central role in binding compounds. However, when relying on the fact that a given domain interacts with a given compound it should be acknowledged that some other proteins with the same domain (which makes them similar) may not interact with this compound. We study this problem and find thousands of these cases. We empirically investigate whether current computational predictors of compound-protein interactions can be effectively used to differentiate these binding and non-binding cases. We show that while the existing methods achieve very high predictive performance for typically used (easy) test datasets, only some of them are able to achieve modest levels of predictive performance for this specific (difficult) scenario. Consequently, the third goal designs, develops, tests and deploys a new solution that aims to improve predictive performance for this difficult scenario. We develop a consensus model that combines predictions from several current and well-performing predictors by using machine learning and applying additional inputs that quantify properties and similarity of compounds and proteins. Our ablation analysis shows that these additional inputs are crucial for the success of our new model, which is shown to statistically outperform the current solutions. We deploy the resulting predictor (MetaBoostCPI) as a convenient webserver for public use.


© The Author

Is Part Of

VCU University Archives

Is Part Of

VCU Theses and Dissertations

Date of Submission


Available for download on Saturday, December 16, 2023