Weighted LLC Latency-Based Run-Time Cache Partitioning for Heterogeneous CPU-GPU Architecture

Li, Cheng-Hsuan

Weighted LLC Latency-Based Run-Time Cache Partitioning for Heterogeneous CPU-GPU Architecture

Date Issued

2014

Date

2014

Author(s)

Li, Cheng-Hsuan

URI

http://ntur.lib.ntu.edu.tw//handle/246246/261488

Abstract

Integrating the CPU and GPU on the same chip has become the development trend for microprocessor design. In integrated CPU-GPU architecture, utilizing the shared last-level cache (LLC) is a critical design issue due to the pressure on shared resources and the different characteristics of CPU and GPU applications. Because of the latency-hiding capability provided by the GPU and the huge discrepancy in concurrent executing threads between the CPU and GPU, LLC partitioning can no longer be achieved by simply minimizing the overall cache misses as in homogeneous CPUs. State-of-the-art cache partitioning mechanism distinguishes those cache-insensitive GPU applications from those cache-sensitive ones and optimize only the cache misses for CPU applications when the GPU is cache-insensitive. However, optimizing only the cache hit rate for CPU applications generates more cache misses from the GPU and leads to longer queuing delay in the underlying DRAM system. In terms of memory access latency, the loss due to longer queuing delay may out-weight the benefit from higher cache hit ratio. Therefore, we find that even though the performance of the GPU application may not be sensitive to cache resources, CPU applications'' cache hit rate is not the only factor which should be considered in partitioning the LLC. Cache miss penalty, i.e., off-chip latency, is also an important factor in designing LLC partitioning mechanism for integrated CPU-GPU architecture. In this paper, we proposed a Weighted LLC Latency-Based Run-Time Cache Partitioning for integrated CPU-GPU architecture. In order to correlate cache partition to overall performance more accurately, we develops a mechanism to predict the off-chip latency based on the number of total cache misses, and a GPU cache-sensitivity monitor, which quantitatively profiles GPU''s performance sensitivity to memory access latency. The experimental results show that the proposed mechanism improves the overall throughput by 9.7% over TLP-aware cache partitioning (TAP), 6.2% over Utility-based Cache Partitioning (UCP), and 10.9% over LRU on 30 heterogeneous workloads.

Subjects

快取記憶體分割

異質性平台

主記憶體存取延遲

中央處理器

繪圖處理器

Type

thesis

File(s)

Name

ntu-103-R01922029-1.pdf

Size

23.32 KB

Format

Adobe PDF

Checksum

(MD5):a16590d33b69233fbebcff7b7a2d137c

Weighted LLC Latency-Based Run-Time Cache Partitioning for Heterogeneous CPU-GPU Architecture

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)