Developer Blog

Shorten Linux boot time with deferred memory block initialization

Written by

Sudarshan Rajagopalan

Jun 2, 2024

If you’re a Linux kernel developer looking for ways to reduce boot time, take a look at deferred memory block initialization. Our Linux Kernel team is always exploring ways to boot devices faster, so we’ve made some progress in this area.

In this post, I’ll explain the background of initializing some system memory blocks in deferred fashion and walk you through options for implementing it. (This is a summary of my presentation “Deferred Memblocks Init for Boot Time Reduction” at Embedded Open Source Summit 2024. See below for details and links.)

System memory initialization

Here’s a visual representation of system memory initialization:

(Note: In the diagram above and throughout this article, I’ll use 12GB as a typical amount of RAM, unless otherwise indicated.)

After the start kernel comes the setup_arch containing memblock_init, paging_init and bootmem_init. Outside the setup_arch is the mm_init.

paging_init is like a foundation of memory management, invoked when the kernel boots up. It sets up the page tables for all double data rate (DDR) RAM.
bootmem_init initializes all the memory blocks by creating the memmap meta data, also known as the page struct. Every page in memory is mapped by page struct, so bootmem_init is responsible for memblock_init. It is also responsible for doing sparse mem initialization of all sections in RAM.
mm_init initializes all the kernel memory allocators like stack_depot_early_init(), kmem_cache_init(), kmemleak_init(), vmalloc_init() and mm_cache_init(). The allocator of most interest here is mem_init(), because its execution time depends on the amount of RAM on the device. mem_init() gives all pages that are free after the boot to the system allocator as part of the free list.

The problem is that the execution time of those memory blocks – including memblock_init – depends on the amount of DDR RAM you have. The larger the DDR, the more initialization work and time spend is required. All four _init functions require memory initialization of all installed RAM, performed in a single-threaded fashion using the boot CPU, even on a symmetric multiprocessing (SMP) system. Therefore, on devices with hundreds of gigabytes – or terabytes – of memory, it takes much longer until the boot sequence is complete and the userspace is up.

The idea, then, is to defer some memory block initialization until later, thereby reducing boot time and getting to userspace sooner.

Deferred memory block initialization

It’s possible to reduce boot time by bringing up the system with just a subset of memmap. It works because you don’t need all available memory to boot the kernel; the subset requires only enough memory to initialize the kernel and userspace. Later – say, once the kernel is initialized and any secondary CPUs are up – the rest of the memory blocks can be initialized in parallel fashion.

This approach is based on DEFERRED_STRUCT_PAGE_INIT. It has been in the five most recent generations of the Linux kernel, mainly for servers with their terabytes of system RAM. The remaining memory is initialized in a deferred fashion using kthreads and a deferred probe mechanism once all the drivers are loaded.

Since this takes place at the end of kernel initialization, it may affect other functions in that window of time, with the potential downside of an impact on performance. As I describe below, the effect is minimal.

This is a visual representation of deferred memory initialization, depicting DDR RAM, time and steps:

Assuming 12 GB of DDR RAM, the first 2 to 3 GB suffice to initialize the kernel and userspace. Using bootmem_init, you tell the kernel to initialize in only that much memory.

To initialize the remaining 9 or 10 GB, you initialize in deferred fashion, either by giving kernel kthreads responsibility, or from userspace. As described below, the userspace option gives you the flexibility to add memory to different nodes and memory zones in the system. You can also add x amount of memory to one zone and y amount of memory to a different zone.

Boot time is a function of the power of the CPU, as the diagram below illustrates:

On the left side, the vertical black bar depicts the normal scenario of booting with, say, 12 GB of RAM. Everything is done with the relatively low horsepower of boot CPU0, so it takes the amount of time shown by the dotted line at the top.

From the middle toward the right, you see how deferred memory initialization works. We use CPU0 to initialize the kernel in only a subset of the memory. Once the kernel and SMP are initialized, we have access to the more-powerful CPU1 and CPU2, shown as wider bars. We use them to schedule initialization of different memory zones in a parallel fashion. Finally, once CPU3 – a much fatter or bigger CPU – is available, we use it to initialize the remaining memory.

It takes less time to initialize the same amount of memory because we’ve offloaded the work from the boot CPU to other, more powerful CPUs. The sooner you can initialize them, the more efficiently you can initialize the remaining memblocks and the sooner you can get to userspace.

Adding memory to required zones

Above, I mentioned that the userspace option of deferred memory initialization gives you the flexibility to add memory to different nodes and zones in the system. Here’s how.

The Linux kernel divides your system RAM into different memory zones. They include ZONE_NORMAL (memory for non-movable DMA allocations), ZONE_DMA (for DMA applications for hardware that supports 32-bit addressing) and ZONE_MOVABLE (for user-space applications). You could configure a system with 12 GB of RAM to have 2 GB of DMA zone, 5 GB of normal zone and 5 GB of movable zone, for a single non-uniform memory access (NUMA) node.

But you can’t use the kernel (kthreads) to add memory, because the kernel will have no knowledge of these configurations, like how much memory to allocate for each zone. Instead of using the kernel to add memory, it’s better to use userspace, which will have knowledge of the configurations and will use the memory hotplug framework. For example:

echo addr > /sys/devices/system/memory/probe

From userspace you can run echo addr to probe, then specify the amount of memory to add at a given address. The size of the memory is fixed – usually 128 megabytes for Arm64 systems – as memory block size.

Once you’ve added that memory, you can initialize it as required.

echo online_movable > /sys/devices/system/memory/memoryXXX/state

online_movable means you initialize the just-added memory block into the Movable zone.

echo online_kernel > /sys/devices/system/memory/memoryXXX/state

online_kernel means you initialize the just-added memory block into the Normal zone.

How to limit boot memory and specify remaining memory

To tell the kernel to initialize only a certain amount of memory during boot, like 2 GB, we can use the mem= command-line parameter. Then, once you’ve specified that, you have to tell the kernel (or userspace) how much RAM remains – in this example, 10 GB. That requires a quick calculation.

The devicetree (DT) contains a memory node that resembles this:

/ {
#address-cells = <2>;
#size-cells = <2>;
memory { device_type = "memory"; reg = <0x0 0x0 0x3 0x00000000>; };

It’s usually initialized by the bootloader according to the size of attached DDR. It populates the reg property, which is the RAM partition table, with the size and start address of the partition. For example, the single entry in reg could specify that there is 12 GB of memory at address zero.

In the kernel driver, you scan that memory node for the reg property; that gives you the size of attached RAM. Then, the bootmem block address indicates where kernel initialization ended; here, 2 GB. Since you know that the bootmem_dram_end is 12 GB, you know that 10 GB of RAM remains. That’s the amount to be added in a deferred fashion, whether you use the kernel or userspace.

Kernel and userspace options for deferred memory initialization

First, I’ll explain the kernel method – using the mem= parameter and a memory hotplug function called add_memory_driver_managed()– and why it offers less flexibility.

The memory hotplug has the following command-line parameters:

memhp_default_state=online means that if you add the specified amount of memory, it will be initialized automatically, and the kernel itself doesn't need to bring those pages online.
movable_node (or kernel node) specifies the memory zone to be added by default.
The limitation of using a kernel driver is that you can add memory only to a required memory zone, and your choice cannot be changed at runtime. In the example above, of 10 GB of deferred memory initialization, you must add all 10 GB as either movable zone or normal zone; you may not split it into more than one zone.

Fortunately, the userspace method overcomes that limitation. It still uses the kernel, reads through the DT nodes, captures the mem= value and determines how much memory remains. But it then exposes that data to the userspace through the sysfs nodes in sys/kernel/deferred_mem_init:

sys/kernel/deferred_mem_init/memblock_end_of_dram is the 2 GB in our example.
sys/kernel/deferred_mem_init/bootmem_end_of_dram is the end of the 12 GB.
sys/kernel/deferred_mem_init/deferred_kernelcore is the amount to add to the kernel zone for the kernel core.
sys/kernel/deferred_mem_init/deferred_movablecore is the amount to add to the movable zone for the movable core.

The userspace method offers you the flexibility to add memory to the required zones that the kernel method does not offer. The only downside is that you must wait until userspace is initialized.

Here is an illustration of DDR RAM once you’ve added memory to required zones:

bootmem_size = 2 GB, because the mem= param is set to 2 GB.
dram_size = 12 GB, the amount of RAM attached to the system.
movablecore-size = 5 GB, specified in a DT property.
That leaves 12 – (2 – 5) = 5 GB remaining memory. That goes to the normal zone. It will be added to the kernel core (kernelcore-size) by userspace through the memory hotplug.

Result: reduced boot time

We have implemented this feature as part of a deferred mechanism in the Qualcomm SM8550 with 12 GB of DDR RAM. By limiting to 2 GB during boot, as described in the examples above, we’ve measured 160-200 milliseconds in savings from adding 8 GB in a deferred fashion. That's roughly 20 to 30 milliseconds of boot time reduction per gigabyte of RAM. The savings are linear, so on systems with more RAM, you save even more time.

We’ve measured the results of this feature by profiling paging_init and bootmem_init and found reductions in the time to load kernel and kernel modules. Our scripts capture the overall boot performance of all the phases from bootloader initialization until kernel initialization, to the kernel module loading, to the Native Daemons; and then from userspace initialization phase that includes Android services and UI apps.

Limitations and observations

We’ve obtained these results with no performance impact, because there's no process or task that needs to use the entire 12 GB during boot. Still, consider these limitations and observations:

The measurements show that pagint_ing, bootmem_init, and mm_init are reduced when you initialize with less memory. Note, though, that those functions are executed before time_init, so you can’t use kernel time to measure the performance indicators. Our measurements rely on other hardware time stamps and clocks.
As mentioned, this feature requires memory hotplug support, which is present in Arm-compatible and x86 architectures.
Although you can initialize the remaining DDR RAM in a deferred fashion, you won’t necessarily achieve full parallelism. For example, if you have 5 GB of memory to be added to the normal zone, you cannot split it into 5 x 1 GB segments, give them to different CPUs in different threads and try to achieve parallelism. Each zone will lock, and other threads trying to add memory to the same zone must wait until the zone lock has been released. On the other hand, if you're trying to add memory to a different zone in a different node, then you can achieve parallelism. An example would be one CPU adding memory to the movable zone and another CPU adding memory to the normal zone.
Our examples assume that 2 GB suffices to boot up the kernel or userspace on a system with 12 GB RAM. We have seen that 2 GB does not always suffice. More mature Android versions and OEM versions usually require 4-5 GB to boot up. The figure varies for each generation of chipset or system, so be prepared to experiment.
The feature demands robust code and error handling. When you boot the system with limited memory and something goes wrong with your deferred memory code, there's no way to initialize the remaining (e.g., 10 GB) memory. Your users will be limited to 2 GB of DDR RAM forever. Since the kernel has most of the responsibility here, it should check whether the rest of the memory is being added to the system or not. Has userspace added it correctly? If not, then the kernel should take over and add the remaining memory using whichever method is available.
There is some chance that this feature could interfere with tasks that run early during the boot sequence. In our measurements, we saw no performance impact on any running tasks once userspace is up.

Our upstream path – watch this space

We have not yet upstreamed this feature. We’ve had long discussions about the exact savings in boot time and the degree of parallelism achieved, and we’ve concluded that the benefits are there. We plan to upstream the kernel method, because if developers or OEMs implement their own userspace method, then it's not an end-to-end solution. And then we will have to rely on how developers implement the user space.

Our plan is to upstream the kernel method by correcting the limitation I mentioned and allow the kernel to add memory to a particular zone and have the flexibility that the user space has. That entails its own limitations, because our API must ensure that the calls for discontinuous zones (e.g., normal, then movable, then normal) are not honored. We’ll look for comments from the upstream community on our approach.

Finally, we plan to implement this as a separate kernel driver that combines the three command-line parameters:

mem= to limit the memory during boot
deferred_mem.kernelcore=nn% as part of the percentage
deferred_mem.movablecore=nn% as another percentage

Instead of relying on the userspace implementation, the kernel driver can be enabled or disabled using a config or a command-line parameter.

And, for more details, see my presentations from Embedded Open Source Summit 2024, with links to slide decks and video:

Deferred Memblocks Init for Boot Time Reduction

Snapdragon Linux Kernel

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

About the Author

Sudarshan Rajagopalan