Virtual Memory

Virtual memory is the system that an operating system uses to isolate the memory of different process. Each process operates under the illusion that it is the only process in memory and has access to every possible memory address. This is not true, but the operating system tricks the process into believing otherwise.

Regions of Memory #

Each memory access needs to be checked to ensure it is permitted. Some common checks include:

Read-only vs. read-write memory (e.g. can you do mov %rax, address?)
Kernel-only vs. user-mode memory (e.g. can you run mov %rax, address in user mode?)
Executable vs. data-only memory (e.g. can you run jmp address?)
Valid vs. invalid addresses

A common internal representation of memory regions is a list of segments. Each segment has a starting address, an ending address, and a set of permissions. The operating system can use this list to check if a memory access is valid. The OS uses the segments to create page table entries, which the hardware can view. The hardware can’t directly see the segments themselves.

A segmentation fault technically has nothing to do with segments. The fault is generated by hardware, which doesn’t know about segments. The operating system reacts to the fault by throwing a “segmentation violation” signal.

In POSIX-compatible OSes, the functions mmap and munmap are provided to request the OS to adjust its segments. User applications use higher-level wrappers like malloc instead.

Hardware access to memory protections is pretty simple. Each address is split into two parts: a page number made up of high-order bits and a page offset made up of low-order bits. Each time memory is accessed in any way, the address is translated from a virtual address in code to a physical address on the memory chip. The steps for this are as follows:

The virtual page number (remember, that’s the high-order bits of the virtual address) is used as the key of a page table, a map between virtual page numbers and page table entries.
Page table entries contain a bunch of protection or flag bits, such as read/write/execute/kernel bits and a flag indicating whether the virtual page has a corresponding physical page. If the access cannot proceed, a fault is generated (if unmapped, a page fault; otherwise, a protection fault). The fault handler should either change the page table to make the address work correctly the second time or it should just kill the program.
If the necessary permissions are present, the page table entry contains a physical page number. Add the virtual offset to the physical page number to get the physical address.

This entire system is designed so that the hardware can guarantee a couple things:

Only kernel-mode code can change the page table.
All memory accesses are guarded by the page table.

This isn’t a perfect system – there are ways to create side channels by taking advantage of various optimizations in the system – but in general, this fulfills its goal of fully isolating processes from each other.

Page Tables #

A page table is basically just a map between page numbers (fixed-size integers) and page table entries (a physical page number plus some boolean flags).

Back in the old days, when there wasn’t a very large addressable space, a single array was good enough for a page table. A special kernel-only register called the page table base register or PTBR would store the address of the array. However, with 32- and (especially) 64-bit addresses, single-level page tables like this are no longer feasible. Instead, we use multi-level page tables. These are an instance of high-arity fixed-depth trees (or wide node fixed-depth trees). The idea is that you make all your leaves the same depth away from the root, but you make each level have a very high arity (i.e. a very large number of children). This allows you to store a very large number of entries in a tree without having to make the tree very deep, cutting down on the number of pointers needed. This is generally much less efficient than a single-level page table. The advantage comes if the majority of indices have no value, since you can leave out entire subtrees with some basic null checks and save a ton of space.

x86-64 processors handle 64-bit addresses as follows:

Ignore the highest-order 16 bits. 48 bits is a big enough address space.
Make 64-bit page table entries. Physical page numbers are larger than virtual page numbers. The OS needs to verify that it is actually using an address available in RAM.
Pick a page size such that every page table node fits in a single page. For example, let’s select 4 KiB – 4096 bytes – as our page size. We can make a four-level page table. Each level has 512 entries, so each entry is 9 bytes. This means that each page table node is 4096 bytes, which is exactly one page. Ta-da!

An address translation does something like this, except compiled down to physical transistors.

size_t vpn[4] = split_bits(va>>12, 9);
size_t ppn = PTBR;
for(int i=0; i<4; i+=1) {
    PTE pte = ram[(ppn<<12) | (vpn[i]<<3)];
    if (pte.unmapped) page fault;
    if (pte.flags & current_usage) protection fault;
    ppn = pte.ppn;
}
pa = (ppn<<12) | (va & ((1<<12)-1));

The programmable parts are the fault handlers, which are more like this:

handle_page_fault(size_t va, int access_type) {
    // Get permitted actions for the virtual address
    int flags = permitted_actions(va, segment_list);
    // If the action we want isn't allowed...
    if ((access_type & flags) != access_type)
        // ...then generate a SIGSEGV
        send_signal(SIGSEGV);
    // Otherwise, get an unused physical page
    ssize_t ppn = get_unused_physical_page();
    // If one doesn't exist, swap another process' page out
    // This means writing the page to disk and freeing the page
    if (ppn < 0) ppn = swap_page();
    // Create a page table entry for the new page
    create_page_table_entry(va, ppn, flags);
}

Given how often address translation needs to happen, making it fast is important to keep the computer efficient. Modern processors use a cache called the translation lookaside buffer (or TLB) to keep the most-used address translations in high-speed memory. Conceptually, the TLB is used as follows:

size_t vpn_full = va>>12;
if (tlb_has(vpn_full)) {
    pte = tlb_lookup(vpn_full);
    if (pte.flags & current_usage) protection fault;
    ppn = pte.ppn;
} else {
    size_t vpn[4] = split_bits(vpn_full, 9);
    size_t ppn = PTBR;
    for(int i=0; i<4; i+=1) {
        pte = ram[(ppn<<12) | (vpn[i]<<3)];
        if (pte.unmapped) page fault;
        if (pte.flags & current_usage) protection fault;
        ppn = pte.ppn;
    }
    tlb_store(vpn_full, pte);
}
pa = (ppn<<12) | (va & ((1<<12)-1));

Usage #

There are a variety of uses for virtual memory.

One use comes when a program is started. Instead of allocating all the memory the program needs at once, the OS can allocate a small amount of memory and then allocate more as the program needs it, which is called lazy allocation.

Another use is that the OS can swap page data between RAM and disk to free up space for running processes. The processes don’t know the difference because all they see is the set of virtual addresses. However, disks are way slower than memory, so this should be avoided when possible. OSes will try their best to pick pages to swap that probably won’t be used anytime soon, but that’s technically undecidable, so it’s not perfect. If a process starts spending more time swapping than doing actual work, it’s considered thrashing.

You can also create shared memory by mapping the same physical memory region to multiple virtual memory regions. This is useful for inter-process communication and shared libraries.

When switching between processes, the OS moves the saved user state for the old process to a data structure that tracks inactive processes, then changes the PTBR to point to the page table of the new process and invalidates the entries in the translation lookaside buffer. The saved state from the OS data structure is moved back into the saved user state location to be restored when the exception handler returns.

Finally, physical memory pages can be given to systems other than software processes. Various I/O operations can be performed by allocating a page or more of memory to be accessed directly by I/O systems. This is called direct memory access or DMA. This is useful for things like network cards, which can just write data directly to memory instead of having to go through the CPU. (Technically, DMA doesn’t require virtual memory, but it’s often implemented as part of the virtual memory system.)