korena's blog

MMU

You might be reading this post out of context, it is supposed to be a branch from other posts on this blog.
All posts labeled bits and pieces are rough writings, possibly direct copies of outdated logs. And they are subject to frequent changes.

What it does

The memory management unit is a hardware component that resides in the core processor. It is responsible for:

  • Virtual-physical address translations
  • domain and memory access permission management

Why virtual Memory?

Historically, Utilizing main memory for program execution has gone through several stages:

  • Single user contiguous scheme

One program in execution (process) could be run in memory at a specific time, only when this program has completely finished executing that another process could be loaded to memory. The main issues with this scheme is slug-slow execution, and physical memory size bounds for programs, meaning if a process's data + code are larger than the available memory, the process cannot be executed, period!

  • Fixed partition scheme

Multiple processes could be loaded into memory, by means of partitioning available physical memory into fixed size blocks, because of the static nature of the partitions, this scheme suffered a lot of memory waste in-between loaded processes, and suffered from efficiency issues, the biggest of which is the need to reboot the computer every time you needed to change the partitions size. but was better than the previous scheme, and gave the OS scheduler more work to do.

  • Dynamic partition scheme

Also partitions the available memory in a fixed manner, but processes are given the memory size they ask for, two algorithms were used to perform this job, the first, is where the OS would look for the first none occupied partition that could fit the demanded process space, and hand it over to the process, the second algorithm is where the OS looks for the smallest available fixed partition that could fit the process, and gives it to the process that demands memory, you should see the efficiency vs speed trade off I presume? This sort of memory management was better than fixed partitioning because it somehow reduced memory waste, and discarded the need for system reboot.

  • Paging scheme

This is where things start looking much like the memory management schemes we have today. By deciding to not load the whole process into memory, designers have freed themselves from the size restrictions of physical memory. dividing the available physical memory space into equally sized chunks called page frames, and dividing the process code + data into chunks that are equal in size to a single page frame each, the memory manager was now able to slot these chunks, each into an available page frame. The page frames used for a single process need not be slotted in consecutive order in physical memory, which gives the memory manager more room to efficiently use available memory. But it adds complications, the memory manager has to keep track of where every and each chunk of code that belongs to each process is, and it uses multiple memory maintained structures (tables) to achieve that.

  • Demand paging memory allocation scheme

As an extension to the paging scheme, memory manager and the OS 'evolved' to work together and only load page frame contents when they are actually needed, and discard of page frame contents that are no longer, or not immediately needed in order to free memory, this is done in accordance to a predefined page replacement policy.

  • Segmented memory allocation scheme

When paging and demand paging memory allocation schemes proved to be rather efficient compared to the previous schemes, designers thought: "Why not break down program code into smaller logically contiguous segments?", and segmented memory allocation scheme was born. In this scheme, programs were broken down into segments that represent small, uneven (differently sized) logical chunks, like functions in C. This meant an additional amount of relational tables and complex logic (overhead) to handle execution and memory.

  • A mixture of Demand paging memory allocation and Segmented memory allocation schemes

This is where designers thought, lets grab the advantages of both worlds by implementing BOTH schemes, and ended up adding a bunch of new tables and complex logic on top of the complexities each scheme naturally carries.

Having gone through all these schemes, you can see that all the dancing around was done (in addition to other reasons) in order to have a more efficient use of a finite available expensive volatile memory. But what if we can 'virtually' extend this memory, and have processes, regardless of their size, think that the physical memory they need to execute is always enough, regardless of the actual size of available physical memory in the system on which they are to run?
This is achieved by means of invading secondary storage memories, for storing chunks and bits of processes that are queued for execution. This is the primary motivation for the use of virtual addresses, there are of course more benefits that will come along, like simplifying memory management in multitasking operating systems, and ridding a user space programmer from the worries of memory limitations.

  • What's a virtual address?

In a System that supports virtual addresses, a virtual address is the address which is seen and used by processes. You see, in such a system, processes live in a virtual memory space, an imaginary array of memory blocks that starts at (usually) base address 0 and ends with an address called the bound address, which denotes the end of the available virtual memory space. Since these processes run on the CPU, the CPU is fooled to believe that this virtual memory space actually exists. Every running process also believes that it is the only process that lives in this virtual memory space!
The MMU however, has a pretty clear understanding of the distinction between a virtual address and a physical address.
To the MMU (and the managing OS), a virtual address is an abstraction of a physical address, that, from a functional point of view, contains the same data that the abstracted physical address does, but lacks the meta-data that the physical address has, particularly, location in a physical device. The MMU is responsible for creating this 'illusion' of a virtual memory space the CPU is so convinced actually exists. it [the MMU] conspires with the Operating System to make sure the CPU, and running processes stay in the dark on this matter.
The MMU is exposed to a physical memory space, the physical memory space is generally divided between main memory (DRAM) and Secondary storage (disks or flash memory, think SWAP space in Linux). The virtual memory space is built on top of this physical memory space.

The most basic concept of today's MMUs

Simplest MMU Concept

The CPU generates an address (because a running process references it), and passes it to MMU, the MMU, using its internal components, translates it to a physical address with a known physical location, checks whether this physical location is in main memory or secondary storage, performs some magic to handle both situations, then checks the access permissions of this address location, and based on that, either delivers the required information to the CPU, or raises an exception to be handled by the OS.

Segmentation based memory management

A basic MMU, of the simplest form manages memory using this scheme, in a nutshell, every process (program in execution) has it's own private view of memory. It sees the memory space as a set of segments, code segment, data segment, etc ...
Every segment has a base address representing its first address, and a bound address representing the last address withing this segment. These information are saved in a segment descriptor table (SDT) maintained by the MMU.

virtual to physical memory translation (segmentation based)

Take the example of the following program:

0x00000000
 
0x00000004   int i = 20;
 
0x00010008   int main(){
 
0x0001000C    int sum;
 
0x0001000F    sum = i + 1 ;
 
0x0001001C    return 0;}

The processor would expect function main to reside in code segment. To execute the contents of main, the CPU will, at some point, attempt to fetch the instructions corresponding to sum = i+1 from the running process's code segment, so it will generate the address 0x0001000F, the MMU will receive this address, and divide it into 0x0001 (Segment selector) and 0x000F (Logical address), the logical address part will be used as an offset from the base into the code segment, to actually get to the actual physical location. Note that 0x0001 is not necessarily equal to the upper half of the actual physical address, the address provided by the SDT could be any location that is mapped to this process's code segment. It is as if the segment selector and the logical address from the generated address are telling the MMU to: "get the segment's actual base address and add an offset of 0x000F to get to the required information.". If the offset exceeds the bound of the segment, an access fault is flagged, and the Operating system has to deal with it.

Paging based memory management

Like discussed previously (under paging scheme above), the OS and hardware (MMU) divide both virtual and physical memory spaces into equally sized pages, the most common page size is 4KB, but that isn't a requirement for implementation. The operating system, with the help of MMU, maintains a structure in memory, that keeps track of the pages that belong to a certain process (private virtual space) and the corresponding physical address space, this structure is called page table, consequently, every process has it's own page table. Despite the fact that the job of managing translations between virtual and physical addresses can very well be handled in software by the operating system, it wouldn't be the most practical, nor efficient way to go about this business, so the act of translation is generally delegated to hardware, particularly the MMU. Paging based memory management gives the system the benefit of using secondary storage for multitasking, context switching, and accommodating large jobs in a small main memory space.

Paging

When the CPU demands an address that belongs to a virtual page that is mapped to a physical page residing in secondary storage, a page fault is signaled by the MMU, and the OS is made aware of the situation, generally, this causes the execution of an exception handler that grabs the physical page from secondary memory and places it in main memory, introducing an unpredictable overhead. This exception handling can be implemented in various ways, the most popular is by utilizing direct memory access (DMA) to move data without hogging the processor.

Page table

A Page table is the data structure that stores the mapping of virtual pages to physical pages, every entry in a page table is called Page table entry(PTE), in it's simplest form, a PTE contains control bits, such as dirty bit, valid bit, permissions, etc, in addition to the mapping of a virtual page base address to a physical page (frame) base address. Another structure that is often maintained is one that keeps track of which frames in physical memory are actually mapped (utilized), this structure is called frame table.
The basic principle of virtual to physical address translation is called a table walk, conceptually, table walks are in a way similar to translations in segmentation based memory system, but its a bit more than that. Before going into the translation process, lets extend on our basic page table understanding.

A more practical view of page tables

In a hierarchical page tables setup, there are multiple levels of page tables, to demonstrate why it is more practical to have multiple levels of tables to implement the paging scheme, consider the following one-level setup:

virtual address break up This illustration represents a table walk to fetch the virtual address (0x000080-0002)
  • The directory part of a virtual address addresses a PTE in a table that holds the base addresses of memory pages that are typically 4KB in size, this table is called a page directory.
  • the page directory table is placed in RAM as a data structure by the OS at some point.
  • every and each process has such a table.
  • the page directory table that is specific to a process is part of it's state, which is saved during context switching.
  • the base address of the page directory data structure is stored in a special register in the processor (control register, often c2 in arm's coprecessor 15), this storing is also performed by the OS.
  • now that the processor knows the base address, the directory bits of a virtual address is used as
    an index into the address page memory region, which is actually an array of directory page entries, while the offset part is used to seek into the actual page frame to reach the desired address.

The offset part is the real actual memory location address into the addressed directory page (not entirely true, read on for an explanation!).

The previous explanation is only a simple view, but picture the following scenario to figure out why it isn't practical:
Let there be a physical page of 4KB (typical).
The main memory holds addressable bytes (1 byte per addressable memory location), we have 4096 (4K) addressable memory locations per physical page. So It takes 12 address bits to address 4096(4KB) of locations (because 2^12 = 4096) This would mean the offset part of the virtual address has to be 12 bits, and the remaining 20 bits are left for the directory part of the virtual address.
If the page directory entries are 4 bytes in size, the addressable page memory table would be 4MB in size (2^20 x 4B = 4MB)
Now imagine a table with 2^20 entries, or an array with 2^20 locations:

uint32_t addressPageMemTable[1048576] ; /* horrible! */

And if each process has its own distinct page directory, much of the RAM is consumed only for saving where everything is!.

The solution to this problem is to use multiple levels of page tables like the following:

virtual address break up for multi level tables This illustration represents a table walk to fetch the virtual address (0x008-018-0002)

If a directory entry is marked empty it obviously need not point to any lower directory. This way the page table tree can be sparse and compact. Again, the highest directory table's base address is saved in a CPU register.

TLB

Even with the use of multiple directory pages levels, performing translations (table walks) with every address reference through the execution of a process is obviously not a very good idea, so CPU designers resorted to the use of a specialized type of cache, where the translation results which are the full computation of physical address that corresponds to a given virtual address disregarding the offset part are saved in a cache called translation look-aside buffer or TLB, the offset part is excluded because it plays no part in the base address calculation of the physical page, since it is used only to determine how far into the physical page should the MMU reach to fetch the required address. Another reason why TLBs are used is the fact that, when cache is taken into account, MMU's access to the page tables (to perform table walks), which are stored in memory gracefully defeats the purpose of cache.

defeating cache purpose

As illustrated above, the CPU generates virtual addresses corresponding to the virtual address space of the running process, before this address is passed to the cache for a possible cache hit, it has to be translated into a physical address, without the existence of a TLB, a table walk has to be performed, which requires access to the page tables in main memory, but hey, the whole purpose of using cache is to avoid a memory access in case of a cache hit!
By caching the page table entries themselves, we can avoid this problem, the TLB (or cache) is where we would cache, according to some policy, some page table entries required for table walks, in addition to, like stated above, saving some translation results, also based on some policy.

Cache purpose regained

In the above illustration, the CPU generates logical addresses, passes it to the MMU, which looks into the TLB for an already translated physical address, if none is found, a table walk is performed (translation) using the hopefully-available-in-TLB (or cache) page table entries, if that is not possible, a fetch from main memory is unavoidable. Note that the TLB can reside in cache, but is often placed inside the MMU.

It is worth noting that handling page faults in this setup is a rather costly process, and optimization of it poses serious complications, some of these complications and the solution for them will be addressed withing a series on this blog.


References & Resources

  • What Every Programmer Should Know About Memory by Ulrich Drepper of Red Hat, Inc.
  • Lecture series on Embedded Systems by Dr.Santanu Chaudhury,Dept. of Electrical Engineering, IIT Delhi
  • Understanding the Linux Kernel, 3rd Edition By Daniel P. Bovet, Marco Cesati

Extra:

This is a short video you'd want to watch: memory manager Unit 3