
Baden / CSE 160 / Fallġ3 The 3 C s of cache misses Cold Start Capacity Conflict Line Size = 64B (L1 and L2) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB GB/s FSB GB/s Chipset (4圆4b controllers) 21.3 GB/s(read) 10.6 GB/s(write) 667MHz FBDIMMs Sam Williams et al Scott B.


Baden / CSE 160 / Fallġ2 Set associative cache Why use the middle bits for the index? 1 valid bit per line valid T tag bits per line tag 0 B = 2 b bytes per cache block 1 B 1 m-1 T bits s bits b bits 0 set 0: valid tag 0 1 B 1 set 1: valid valid tag tag B 1 B 1 valid tag 0 1 B 1 set S-1: valid tag 0 1 B 1 Randal E. Baden / CSE 160 / Fallġ1 Accessing a Direct mapped cache =1?" (1) The valid bit must be set" selected line (i):" 1" 0" 1" 2" 3" 4" 5" 6" 7" 0110" w 0! w 1 " w 2 " w 3" (2) The tag bits for the cache" line must match the" tag bits in the address" m-1" =?" t bits" 0110" tag" s bits" b bits" i" 100! Line index"block offset" 0" (3) If (1) and (2) are true, " then we have a" cache hit the block offset " selects" starting byte. Baden / CSE 160 / Fallġ0 Simplest cache Direct mapped cache Line 0:" valid" tag" cache block" selected line" Line 1:" valid" tag" cache block" " t bits" s bits" " b bits" Line 1 S-1:" valid" tag" cache block" m-1" tag" Line index"block offset" 0" Randal E. Baden / CSE 160 / Fallĩ Different types of caches Separate Instruction (I) and Data (D) Unified (I+D) Direct mapped / Set associative Write Through / Write Back Allocate on Write / No Allocate on Write Last Level Cache (LLC) Translation Lookaside Buffer (TLB) Core2 Core2 Core2 Core2 32K L1 32K L1 4MB Shared L2 FSB GB/s 32K L1 32K L1 4MB Shared L2 Core2 Core2 Core2 Core2 32K L1 32K L1 4MB Shared L2 FSB GB/s 32K L1 32K L1 4MB Shared L2 Sam Williams et al Scott B. Baden / CSE 160 / FallĨ Sidebar If cache memory access time is 10 times faster than main memory Cache hit time T cache = T main / 10 T main is the cache miss penalty And if we find what we are looking for f 100% of the time ( cache hit rate ) Access time = f T cache + (1- f ) T main = f T main /10 + (1- f ) T main = (1-(9f/10)) T main We are now 1/(1-(9f/10)) times faster To simplify, we use T cache = 1, T main = Scott B. Baden / CSE 160 / Fallħ The Benefits of Cache Memory Let say that we have a small fast memory that is 10 times faster (access time) than main memory If we find what we are looking for 90% of the time (a hit), the access time approaches that of fast memory T access = (1-0.9) 10 = 1.9 Memory appears to be 5 times faster We organize the references by blocks We can have multiple levels of cache 2013 Scott B. Baden / CSE 160 / FallĦ An important principle: locality Memory accesses exhibit two forms of locality Temporal locality (time) Spatial locality (space) Often involves loops Opportunities for reuse Idea: construct a small & fast memory to cache re-used data 32 to 64 KB 256KB to 4 MB Smaller and faster CPU L2 1CP (1 word) L1 2-3 CP (10 to 100 B) O(10) CP ( B) for t=0 to T-1 for i = 1 to N-2 u=(u + u)/2 TB to PB GB DRAM Disk O(100) CP O(10 6 ) CP 2013 Scott B. Processor-Memory Performance Gap: (grows 50% / year) DRAM 7%/yr Scott B. Baden / CSE 160 / FallĤ Today s lecture The memory hierarchy Cache Coherence and Consistency Implementing synchronization False sharing 2013 Scott B. Baden / CSE 160 / Fallģ Announcements SDSC Tour on Friday 11/1 EE Times 2013 Scott B. BadenĢ Using Bang coming down the home stretch Do not use Bang s front end for running mergesort Use batch, or interactive nodes, via qlogin Use the front end for editing & compiling only 10% penalty for using the login nodes improperly, doubles with each incident! EE Times 2013 Scott B.

1 CSE 160 Lecture 5 The Memory Hierarchy False Sharing Cache Coherence and Consistency Scott B.
