Hao WenFollow


Defense Date


Document Type


Degree Name

Doctor of Philosophy


Electrical & Computer Engineering

First Advisor

Wei Zhang


Current heterogeneous CPU-GPU architectures integrate general purpose CPUs and highly thread-level parallelized GPUs (Graphic Processing Units) in the same die. This dissertation focuses on improving the energy efficiency and performance for the heterogeneous CPU-GPU system.

Leakage energy has become an increasingly large fraction of total energy consumption, making it important to reduce leakage energy for improving the overall energy efficiency. Cache occupies a large on-chip area, which are good targets for leakage energy reduction. For the CPU cache, we study how to reduce the cache leakage energy efficiently in a hybrid SPM (Scratch-Pad Memory) and cache architecture. For the GPU cache, the access pattern of GPU cache is different from the CPU, which usually has little locality and high miss rate. In addition, GPU can hide memory latency more effectively due to multi-threading. Because of the above reasons, we find it is possible to place the cache lines of the GPU data caches into the low power mode more aggressively than traditional leakage management for CPU caches, which can reduce more leakage energy without significant performance degradation.

The contention in shared resources between CPU and GPU, such as the last level cache (LLC), interconnection network and DRAM, may degrade both CPU and GPU performance. We propose a simple yet effective method based on probability to control the LLC replacement policy for reducing the CPU’s inter-core conflict misses caused by GPU without significantly impacting GPU performance. In addition, we develop two strategies to combine the probability based method for the LLC and an existing technique called virtual channel partition (VCP) for the interconnection network to further improve the CPU performance.

For a specific graph application of Breadth first search (BFS), which is a basis for graph search and a core building block for many higher-level graph analysis applications, it is a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. These irregularities limit the parallelism of BFS executing on GPUs. Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual-BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently. This approach ensures correctness and can significantly improve both the performance and energy efficiency on GPUs.


© The Author

Is Part Of

VCU University Archives

Is Part Of

VCU Theses and Dissertations

Date of Submission


Available for download on Thursday, December 07, 2023