Ever tipped a mediocre waiter 20% only to discover later that you *tremendously* over-tipped since 18% had been already included? Or installed software that changed your default search engine to Bing before you realized? Or purchased a gift from an online retailer which automatically opted you into its annoying newsletter and monthly catalogue? Now you’re stuck with less cash, suboptimal search results, and a cluttered inbox/mailbox, all from being oblivious to secret defaults. Upgrades inflict the same kind of pain. For example, when you upgrade to the most recent Linux distro on the latest CPUs, you’re
automatically enrolled in 5-level page tables
(up from 4-level). Is that a big deal? Will this automatic opt-in impact you like the aforementioned automatic gratuity or search engine change or mailing list subscription?
To answer these questions, we’ll discuss the purpose of page tables, the motivation for moving from 4 to 5 levels, and its performance implications. Along the way, we’ll run a couple experiments with some
Active Benchmarking
layered in.
In my personal life, I’m more of a
“Why not?”
guy. Spar a bigger guy at the boxing gym – why not? Go jump out of a plane to conquer my fear of heights – why not? Approach that cute stranger I noticed while out on my lunch break and ask her out – why not? But in my duties as a Performance Engineer, I’m a
“Why?”
guy. There must be a compelling reason for me to adopt something new in Production.
Therefore, in the interest of not burying the lead, I recommend
disabling
5-level page tables if your machine doesn’t contain more than 64TB of RAM. Disabling it is as simple as adding “no5lvl” to your kernel boot parameter list. You can check whether your system has this feature enabled by running “
lscpu | grep -i la57
“.
And now onto the show!
Purpose of Page Tables
Our applications use virtual addresses for code and data, but these must be translated to physical addresses that the HW works with. Therefore, every memory access requires a virtual-to-physical address translation performed by the per-core Memory Management Unit (MMU). The MMU consults an in-memory data structure to facilitate this function.
Making this data structure a linear array of 1-to-1 virt-to-phys mappings for every process would be wasteful since apps typically use only a fraction of the available virtual address space. So, a sparse multi-level radix tree is used, instead. This tree is known as the “page table”. When an app references memory at virtual address 0x7fdbef789000, the MMU walks this in-memory page table to find its associated physical address.
“Ok, so then what’s the story on the 4-level vs. 5-level controversy? Why should I care?”
Well, recall that every memory reference requires the MMU to perform this virt-to-phys translation. So we want that process to be as fast as possible for performance. Let’s dive a little more deeply into this MMU walking process to get a better grasp on this topic.
Page Table Walking
Your app
malloc()s
some memory which returns a 64-bit pointer to its address. While the datatype used to store this pointer is 64 bits in size, only 48 of those bits participate in the MMU translation process on relatively recent CPUs. The remainder is sign extended from bit 47. Now, let’s say your app wants to write to that
malloc()-ed
address space – that’s when the MMU jumps into action.
It splits the top 36 of those 48 bits into four (4) groups of 9 bits, each of which indexes into a descending level of page tables before it reaches the physical address of a page in RAM. The final 12 bits of the 48-bit virtual address represent the offset into the aforementioned physical page.
NOTE
: The CR3 register depicted below contains the physical address of the top-level of the page table for the scheduled process.
Figure 1: 4-level Page Table
As illustrated in
Fig. 1
, bits 47 – 39 form an index to an entry in the Page Global Directory (PGD) which holds the physical address of a Page Upper Directory (PUD). Bits 38 – 30 index to an entry of that PUD which, in turn, holds the physical address of a Page Middle Directory (PMD). Then, bits 29 – 21 index to an entry of that PMD which holds the physical address of a Page Table Entry (PTE). Next, bits 20 – 12 index to an entry of that PTE which holds the physical address of the 4KB page in RAM. Finally, the last 12 bits select the offset within that 4KB page. That is the MMU page table walk process in a nutshell.
It would really suck for performance if every translation required the MMU to go through that entire walk process. After all, we know full well how
expensive RAM access
is. Let’s briefly discuss some of the structures that help speed up the process.
Translation Lookaside Buffer (TLB)
Just like every core on a CPU contains multiple levels of instruction and data caches, it also contains multiple levels of
TLBs
. These structures cache the results of MMU page walks to reduce the need for walking the page table in the near term. Due to its associativity level and limited size, the TLB rarely offers a 100% hit rate. But coding and OS configuration techniques both help increase its caching effectiveness.
Paging Structure Caches
Both Intel and AMD utilize special caches for the upper page levels in the hierarchy in the event of a TLB miss. Looking back at
Fig. 1
, these caches would hold info about the PGD, PUD, and PMD levels of the page table. TLBs cache the final PTE level info.
Intel refers to these caches as
Paging Structure Caches
while AMD refers to them as
Page Walk Caches
. While organization specifics differ, each serves the purpose of lightening the MMU load in the event of a TLB miss.
From 4 to 5 levels – Why?
Unfortunately, 4-level page tables and 48-bit virtual addresses are only good for a measly 64TB of RAM. Of what use is that pittance?!? Seriously, though, modern architectures built around memory disaggregation-enabling protocols like
CCIX
,
CXL
, and
Gen-Z
will turn 64TB into the 21st Century version of Bill Gates’ oft-quoted but misattributed 640KB. To break through that RAM limit, the usable virtual address bits were extended from 48 to 57 bits. That extra group of 9 bits indexes into a 5th page table level, the Page P4 Directory (P4D), illustrated below in
Fig. 2
.
Figure 2: 5-level Page Table
There you have it! If you’ll need more than 64TB of RAM in the near future, you’re ecstatic about this new default kernel option. But what about the vast majority that comprises the rest of us? Should *we* be happy with this “automatic gratuity”?
Benchmarking 4-level vs. 5-level Page Tables
As I’ve written before in past articles, there is no benchmark quite like your own application – the very one that you run in Production, the one upon which your company depends. So that’s exactly what I did. I ran a comparison between both modes in my R&D lab. And the results? Nada. Nothing. Zilch. No measurable impact.
But wait! You gotta remember that my shop is all about low latency trading. This means, for our trading apps which run in software, we employ all the optimal coding and configuration tips
alluded to in prior articles
. Optimal L1d cache usage, cache warming, thread pinning, small working memory footprint, huge pages, pre-faulting, TLB Shootdown prevention,
fine-tuned DDIO
, kernel bypass, etc.. Our TLB miss rate is infinitesimal to the point of being a non-factor. And whatever miss rate we do incur, I’m certain the
Intel Paging Structure Caches
pickup the slack enough to drastically minimize MMU page walks given our memory access pattern.
Even still, I disable 5-level page tables anyway. I mean, really, why allow an extra level of MMU page walk indirection just for the heck of it? If you’re giving me the option, I’m gonna take the shorter route!
But still, I was curious to measure its impact in more typical application workloads (low latency trading is an outlier). So I leaned on some industry benchmark packages for experimentation.
How Do We Select Pertinent Benchmarks?
I don’t have access to a machine with gobs of RAM. But what I *do* have is insight into other uses which page tables serve. For example, one is page faulting – the kernel must traverse the page tables to accomplish this task. I’d say page faulting is a pretty common and significant task, wouldn’t you?
With that goal in mind, I used LmBench’s
lat_pagefault
1
numactl –membind=1 –cpunodebind=1 ~/lmbench/bin/x86_64-linux-gnu/lat_pagefault -N 100 /tmp/bigfile
for single-threaded minor fault measurements, and
ebizzy
2
numactl –membind=1 –cpunodebind=1 ~/utils/benchmark/ebizzy-0.3/ebizzy -m -l -p -t 7 -n 7 -S 10
for multi-threaded experimentation. All tests were run on an Ice Lake CPU system running Rocky Linux 8.5, using two (2) different forms of CPU isolation for each test: isolcpus for
lat_pagefault
and the systemd CPUAffinity setting in /etc/systemd/system.conf for
ebizzy
. A fresh reboot precedes every test. Each test executes 30 times with the mean and 95% Confidence Interval (CI) calculated from the runtimes. I use arithmetic mean for
lat_pagefault
since it reports duration (μs). And I use harmonic mean for
ebizzy
since it reports throughput (records per second).
Single-threaded Benchmarking
Using the same 12MB file, I measured the time it took to fault in pages to a process’s address space using
lat_pagefault
and processed the results with R.
I’ve chosen the widely used
ebizzy
benchmark for our multithreaded page fault experiment. Here’s how the benchmark is explained in its source code header:
This program is designed to replicate a common web search app workload. A lot of search applications have the basic pattern: Get a request to find a certain record, index into the chunk of memory that contains it, copy it into another chunk, then look it up via binary search. The interesting parts of this workload are:
Notice the increasing margin in throughput between the two as the number of threads increases. But how can we be certain that this is due to page fault overhead? I ran a separate set of tests with the repetitive
munmap()
calls taken out of the main thread loop. Note the identical throughput in this scenario:
First things first: do you notice a function call in one of the runs that doesn’t appear in the other? Yep! Kernel function
p4d_offset.part.18
takes up cycles only in 5-level page table mode. Do you remember from earlier sections why that might be? From
Fig. 2
above, we see that the P4D page is added for 5-level page tables, so this makes perfect sense.
The other difference we note from “perf diff” is the extra time spent in spinlocks,
_raw_spin_lock
, for 5-level page tables. This, also, is due to the extra upper-level directory. Updates at each level of the page table is protected by spinlocks. However, at the PMD and PTE levels these work at page granularity. On the other hand, spinlocks for updates at the uppermost levels lock the
entire
page table. It’s understandable, then, how an extra page table level at such lock coarseness would interfere with page fault scalability.
SIDE NOTE
: Notice the overhead imposed by the *completely* unnecessary
cgroup_throttle_swaprate
function for both 4-level and 5-level page tables? Every page fault incurs that overhead even if you include “cgroup_disable=memory” in your kernel boot parameter list! Another in a long list of reasons to avoid runtime page faults in your latency-sensitive applications. But on the bright side, it looks like there’s a
patch in the works
to address this issue.
Shall we round out our benchmarking session with a direct measurement of page fault latency with our friend,
eBPF
? Let’s start with a snapshot from the 4-level page table
ebizzy
test using four (4) threads:
Notice the mode difference between them. Page fault latency primarily clusters within the 256 -> 511ns bucket for 4-level, but within the 512 -> 1023ns bucket for 5-level. Our experimentation demonstrates that the extra P4D level, coupled with its associated spinlock overhead, contributes to this latency shift.
Of course, we could perform MUCH more testing for a fuller grasp of 5-level page table impact. Page faulting, after all, isn’t the only instance during which the kernel walks the page table. Then, of course, there’s the MMU page walkers themselves. We could track differences in the
dtlb_(load|store)_misses.walk_active
PMU metrics between both page table modes for that. But as a wise person once said,
“Ain’t nobody got time for that!”
. So, I’ll leave the rest as an exercise for the reader.
NOTE
: For an in-depth discussion on using
perf
for profiling and reading hardware PMU counters for application performance analysis, check out our book
Performance Analysis and Tuning on Modern CPUs
.
3
Paid affiliate link
Always Read the Fine Print
I could’ve avoided double-tipping if only I’d read the check before paying the bill. I could’ve clicked
Do Not Accept
on the search engine switch during that software installation if I’d read the whole agreement. And I could’ve clicked
Do Not Subscribe
after that online retailer purchase if only I’d read the entire Order page. Doing so would’ve spared me the hit to my pocket, my search result quality, and my INBOX.
Likewise, keeping abreast of upcoming changes with your chosen Linux distribution can spare you performance regression headaches at each upgrade cycle (remember the “numa_balancing” headache after upgrading from RHEL 6 to RHEL 7?). Based on the aforementioned, you just might wanna click the
Do Not Enable
dial box on 5-level page tables. Especially if you run multithreaded applications in performance-sensitive environments. And, for cryin’ out loud, avoid runtime page faults in latency-sensitive environments as much as humanly possible!