korena's blog

Cache memory

You might be reading this post out of context, it is supposed to be a branch from other posts on this blog.
All posts labeled bits and pieces are rough writings, possibly direct copies of outdated logs. And they are subject to frequent changes.

So the CPU has a cache memory that is a form of high speed memory (SRAM), it is not necessary for the system programmer to manage it (with some exceptions), because the processor is designed to make the best use of it independently.

Caches are introduced to the memory system of a processor to narrow the gap between the relatively slow main memory and the super fast processor, multiple levels of cache are added for the same reason.

Here is a scenario in which the processor would perform better by using the cache memory:

  • A loop is present in the code, which indicates that some memory addresses in the execution space of that code (in RAM) are likely to be accessed frequently in a very short time, for a faster access to these locations, it would be good to copy them into cache, which has a faster access speed.

The size of cache is significantly smaller than main memory, but the fact that data that are used in an execution session are usually placed near to each other in RAM, and the fact that they are usually very limited in size makes it practical to use a cache that small. This is explained in the
principle of locality: programs access small portions of address space at any given instance of time.

It is worth mentioning that there are two types of locality,the first is temporal (locality in time), if a memory location is accessed by the processor at a given time, chances are it will be accessed again soon ( like loops). the second is spacial (locality in space), if a memory location is accessed, then chances are its immediate neighboring positions will be accessed also (like array scanning).

NOTE: The processor usually has a proper hardware-tailored address space section to deal with cache (SRAM).

In a split cache setup, code and data are separated, code is cached in a dedicated cache, which stores the decoded instructions to be executed, so that the processor would not have to wait for the instruction decoder to decode instructions in its relative slow speed.
data is cached in its own cache as well.

NOTE: If cache is enabled, all loads and stores (LDM/STM ARM instructions) are passed through the cache before they're to be delivered from/to main memory (RAM).

There are usually multiple levels of cache, L1,L2 ... because of speed considerations, the difference between the speed of main memory and the processor was the reason for caching in the first place, and it remains the reason for adding more and more cache levels.

By default all data read or written by the CPU cores can be stored in the cache. But there are memory regions which cannot be cached.

Here's a scenario of a CPU making use of L1 data cache:

  1. the processor wants to fetch a data word from memory.
  2. the first thing it does is ask the L1d for it. (by passing it the address)
  3. if it is available, it will save the time and just use it.
  4. if not, it will just fetch it from main memory(assuming no L2 cache).

Cache structure

The main internal components of a cache are cache controller & cache memory.
The cache memory is structured as an array of rows as illustrated in the following figure:

80B cache memory

Every row is a storage unit called cache line or data block. The cache tag part of the cache line contains (part of) the address of the actual data fetched from the main memory, while the data part of the cache line contains the actual contents of the tagged memory address. The 'v' status bit informs the cache controller whether the cache line contains valid (available to processor on demand) data or not, whereas the dirty bit 'd' indicates whether the data contained in the cache line is coherent with the actual data stored in main memory.

The cache controller's function is to copy data from main memory into cache memory, and provide the processor with the data it requests from main memory, it performs this task by (transparently) intercepting the communication between the processor and the memory controller, it basically takes the address that is requested from main memory by the processor, and splits it into a tag field, a set index field and a block offset field.

An effective memory address is split into the following:

tag|set index|block offset|

set index: Describes which cache row (which cache line) that the data has been/is to be put in.
block offset: Specifies the desired data within the stored data (four words for each cache line in our 80 bytes cache) within the cache row.
tag: used by the cache controller, for a cache hit, this tag, and the cache-tag of the specified cache line should match(read on).

After splitting the effective memory address requested by the processor, the cache controller uses the set index field to locate the cache line withing the cache memory, it then proceeds to investigate whether the found cache line's valid bit is set, and that the tag field from the split address matches the cache tag field in the directory field of the cache memory. if both are successful, it is a cache hit, if either fails, it is a cache miss.

On a cache miss, the cache controller performs a cache line fill, basically, it copies a block of data (4 words in our 80 bytes cache) into a cache line, updates the status bits, and supplies the processor with the requested data.
On a cache hit, the cache controller provides the requested information to the processor directly from cache memory using the block offset field of the effective memory address.

On power-up, some processors possess a hardware component that sets all the valid bits in all the caches to "invalid" (Cortex-A8). Some systems also set a valid bit to "invalid" at other times -- such as when multi-master bus snooping hardware in the cache of one processor hears an address broadcast from some other processor, and realizes that certain data blocks in the local cache are now stale and should be marked invalid.

Set associative cache implementation concepts

There's a number of cache implementations around, but I will address the common set-associative cache organization, because it is the implementation used for main cache in ARM cores, if you are interested in studying the simpler direct mapped caches, you may refer to section 8.4.2 Direct mapped caches of Cortex-A series programmer's guide document, which gives a compact explanation of this type of caches, and provides a simple example to demonstrate it's shortcomings (thrashing).

From a conceptual perspective, we can view set-associative cache as a two dimensional space, this space has rows which are called SETS, and columns called WAYS. The addressable 'intersections' of these rows and columns are called cache blocks (cache lines), which are the smallest units that can be saved in this two dimensional space. Each cache block can be treated as an array, the elements of which are the cached data values.
It follows that, in order to retrieve a cached value, the cache controller has to break down the address it receives as an input, into an agreed upon scheme. this action is performed in hardware by the cache controller.

Basic breakdown of an address in set-associative cache

The size of the offset part chunk is determined by the amount of data each cache block has, so if each cache block can be conceptually represented by an N-elements array, meaning it can hold N 'retrievable' data units (i.e. words), then the offset chunk of the broken down address would be equal to LOG2 N , for example, if a cache blocks has 16 data units, then we would need (LOG2 16 = 4) bits to be able to address the whole span of data within the cache block (cause 24=16).

The size of the SET INDEX and TAG parts of the broken down address can be determined by knowing the number of sets available in the implemented cache, combined with the knowledge of the number of ways (rows) each column (set) contains, the size of cache blocks, and the overall size of the cache. The technical sheet of the implemented cache in your system would tell you that this cache is a N-ways set-associative cache, which basically means that each set has N ways

Let's take the example of a 32KB 4-way associative cache, with 64 Byte blocks and 32 bit byte-addressable address space (you would find these information in the datasheet of your cache implementation), this means that there are 4 ways in each set, and each way hosts a cache block of size 64B, having that in mind, the total number of blocks is equal to the capacity divided by block size (32768/64 = 512 or 29), knowing the number of blocks, the number of sets would be equal to the number of blocks divided by the number of ways in each set (29/ 22 = 27), so the width of the SET INDEX would be 7 bits. using LOG2 64 , we find out that the OFFSET part is 6 bits wide, which leaves (32 - 7 - 6 = 19) bits for the TAG part.

The knowledge of this calculation procedure is useful when writing cache cleaning/invalidation assembly routines, however, generic routines are usually provided in the core's documentation (at least for ARM).

Set associative cache implementation schematic

4 way set-associative cache schematic

From a hardware perspective, a set associative cache is wired up as illustrated in the figure above, several advantages are gained by such an implementation, the most notable is the fact that when cache capacity is increased, more sets are added, and no additional comparators/gates are required to cope with the size growth, unless associativity is increased (by moving from 4-way to 8-way set-associative cache for example). Note that this schematic does not show the validity bit logic handling.

Cache and MMU

If a processor core supports virtual addresses, the physical position of the cache relevant to the MMU defines another aspect of cache implementation, if the cache is placed between the processor and the MMU, with the MMU connected to the memory controller on one side, and to the cache controller on another, then the addresses the cache stores are virtual addresses, and it will have to depend on the MMU to perform the translation into physical addresses, in this sort of setup, the cache is called logical/virtual cache.
Another setup that affects the functionality of cache is when the cache hardware is placed between the MMU and the memory controller, in which case the stored cache addresses are physical addresses, in this setup, the cache is referred to as physical cache, this is a more common setup.

References

  • ARM System Developer’s Guide, Designing and Optimizing System Software.
  • What Every Programmer Should Know About Memory by Ulrich Drepper of Red Hat, Inc.
  • Cortex-A Series programmer's guide.