The memory allocator for RDM pre 14.2 was designed primarily for flexibility rather than speed. In certain cases, the performance could drop dramatically. One feature that we have dropped from this new memory allocator is the capability to allocate memory using a handle and then later free all that memory associated with the same handle at once. While this feature made it easy to avoid hard memory leaks (memory that is not freed upon termination), instead we would often have soft memory leaks (memory that is not used after a certain condition but eventually will be freed).
In RDM 14.2, we designed a new memory allocator with the goal of solving the problems we had with the old memory allocator. It is a completely new design optimized for speed at the cost of having to rewrite a lot of our code that needed dynamic allocations. For RDM 14.2, we have also redone many subsystems in such a way that we no longer do memory pooling, making sure to use memory from the system stack, or use APIs that have the object statically allocated within other objects.
The main features of the memory allocator are, from a top level view:
- Allocations and frees are done using handles
- Memory allocated must be explicitly freed
- Most memory allocations and frees are of order of execution O(1)
- There are several alternative implementations
- A master allocator is handling allocations of 4K pages
- Sub allocators are handling allocations smaller that 2K
- A stack allocator
- System stack allocator
Our memory allocators have a number of alternative implementations that can be selected at compile time. Some of these implementations differ in performance characteristics, while others are used for debugging and QA purposes.
For instance, we have an implementation that sits on top of malloc and free that can be used with memory profile tools for finding issues including buffer overruns, memory leaks, and thread issues. Purify (Windows), Memcheck (Unix), Helgrind (Unix), and efence (Unix) all work with this implementation. We also have an implementation that is integrated with our test framework for doing memory failure simulations. Lastly, there is also an implementation that can profile allocation sizes. These three implementations will not be discussed any further in this document.
Use of handles
The new allocator uses handles similar to the old allocator. The big difference is that the new allocator is completely reentrant. The old allocator had a concept similar to our sub allocator, but there were no master allocators. The old allocator used some global state, where how some of the different subsystems were set up and shut down was messy and not easy to review. The new allocator and a redesign of our platform support layer (PSP) have made this much easier.
Because of this new allocator design, two instances of RDM can now be embedded into one process or address space without causing the allocations for one to affect the allocations of the other. Both instances instantiate their own master allocator, and those handles are used directly or indirectly for each subsequent allocation. This is also what we do in our QA test framework. One master allocator is used in the product, and another one is used in the QA test framework. This allows possible memory issues in the QA test framework to not affect the product under testing or vice versa and memory profiling can be done even for tests using the QA test framework.
Two types of allocators can be retrieved from the master allocator: the sub allocator and the stack allocator. These allocators also return handles, and their implementations will get their memory from the master allocator.
The new memory allocators have been designed with threading in mind. The master allocator is the only allocator that can be used from multiple threads. The sub allocators do not have built-in thread protection. Each sub allocator should normally only be used from one thread at a time. That memory allocations and the use thereof is local to one thread is also assumed to help with locality.
One exception is when a sub allocator deals with allocations larger than 2K. In this case, it is safe for it to be allocated and freed from several threads at the same time since those allocations are designed to be handled by the master allocator.
Paging and swapping pages to disk
For systems with virtual memory, it is important to avoid spreading allocations randomly across more memory pages than necessary. This is done by having the master allocator split the memory into pages that are exactly aligned to 4K. The master allocator keeps metadata about the pages it administrates separately. That way, a page that is being freed does not need to be touched. In the case where the page has been paged out, it will stay paged out. A sub allocator will also be using a number of pages exclusively. This approach is assumed to help avoid overuse of virtual memory (albeit just a tiny bit).
The master allocator aligns memory allocations to 4K, and the sub allocator has the capability to align data to 64 bytes (depending on the profile compiled). This ensures that allocations cross as few 64 byte cache lines as possible. Some of our algorithms rely on this so we can guarantee at most one cache miss.
Mitigating memory leaks
In the case there are memory leaks, asserts in the code will trigger when in debug mode. In release mode, any allocations derived from the master allocator will be freed when a sub allocator, stack allocator, or the master allocator is destroyed.
The Master Allocator
The master allocator as we have outlined above is responsible for allocating pages. When a number of pages have been allocated, only the handle to the master allocator and a pointer to the start of the memory is needed for freeing the pages.
The master allocator requests large blocks of memory from the OS. Each of these large blocks has a header. For each 4K page, there is a corresponding piece of memory in the header that is used to administer the page. If the page is the head or the tail of an allocation or a free memory region, there is information about this as well as its size. This information is used as pages are freed, where we can efficiently combine memory regions into larger regions. The invariant for this is that there are never two free memory regions next to each other, and that all the memory outside the header is either free or allocated with information for each head page,and each tail page for each allocation and free region showing such. All other pages between a head and a tail are marked as not being a head or a tail.
The corresponding piece of information in the header also has space for a self-balancing binary search tree (AVL) node that is used to maintain an AVL of all the free memory regions sorted by size. Using this AVL, we can find a fitting size with an order of O(log n) where n is the number of free memory regions. Since we combine free memory into regions, and we always use the smallest free region that fits for an allocation, it is assumed that n is small compared to the number of pages.
Allocating and freeing pages only uses memory in the header, and is therefore assumed to be cache and especially paging efficient. If the system is frequently allocating and freeing pages, the corresponding part of the header will be hot and is not likely to be dropped from the CPU cache or paged out. However, if the system becomes static with respect to page allocations, those pages may be dropped out of the CPU cache or even paged out.
With the above considerations in its design, our page allocator is assumed to be very efficient. However, some use cases could lead to fragmentation. It may be an idea to keep allocations that are more static separate from those that are more dynamic. On a virtual memory system where paging happens regularly, this is less of a concern since it only leads to some pages not being used and they may end up being paged out (and never paged in again).
The Sub Allocator
The sub allocator as outlined above is responsible for allocating and freeing memory for a single thread or task. A sub allocator is not directly associated with any thread or task, but we make sure that separate threads in RDM use their own sub allocator.
There are three main types of allocations handled by the sub allocator:
- Allocations smaller than about 2K
- Allocations that can be handled by the master allocator
- Allocations that need to be passed to the OS
The main focus of this section is on the first case above. But let’s first discuss the two other cases briefly.
Allocations handled by the master allocator
Every sub allocator, when instantiated, will be associated with a master allocator. If an allocation is larger than what the sub allocator can handle itself, it may be passed to the master allocator. The master allocator will return a number of 4K pages that are page aligned. Since these allocations are always aligned to a page, this is how the sub allocator knows that this allocation was handled by the master allocator, and when this is later freed, it can be handed directly to the master allocator.
Allocations passed to the OS
If an allocation is too big for the master allocator (by default about 650K), the allocation will be handled by the OS. In this case, a special header is reserved. When this allocation is later freed, the special header tells the sub allocator to hand it directly to the OS.
The Core part of the Sub Allocator
Now we are ready to go into the details of the sub allocator. There are three main implementations in addition to what we have mentioned before. The actual implementation to use can be selected at compile time. First, we will discuss some characteristics common to all three implementations.
As mentioned above, the sub allocator is instantiated from a master allocator. It will get a number of pages from the master allocator. One page is requested when it is first instantiated. One additional page is requested each time it gets an allocation request that it cannot fit into the pages it already has. The pages allocated from the master allocator are never turned back to the master allocator except when the sub allocator is destroyed.
The core part of the sub allocator only handles particular sizes. There are typically two sizes for each power of two and the requested size will be rounded up to the nearest supported size. Exactly which sizes are supported is determined by the profile used. The profile compiled in by default is optimized for cache performance, meaning no allocation will cross more 64-byte boundaries than needed.
Each page administered by the core part of the sub allocator is reserved for one particular size. This allows for very little overhead as no header is needed for each individual allocation. When memory is freed, the page header has all the information needed to efficiently administer the memory.
Allocation of fixed sizes
Some of our algorithms do small fixed size allocations. We used to use memory pools, but as it turned out, our new allocator is about as efficient as the memory pools we used before. By also doing some code inlining using macros, our new memory allocator outperforms the old implementation with the memory pool. Using the new memory allocator also has the advantage that memory does not need to be allocated for a specific purpose or how much memory needs to be reserved for it.
There are two main disadvantages to this approach.
First, for applications that allocate many chunks of memory of one size, free them, and then allocate many chunks of memory of a different size. It is assumed that this could be the case to a certain extent for RDM for certain use cases but it is believed to not outweigh the benefit of this particular approach. The described approach allows for very efficient algorithms.
Second, having each page only handle allocations of a particular size means that we will waste some memory. In the long run, where only allocations are done of varying sizes on average,2K will be wasted for each size supported. For the default profile, this accumulates to 36K. This might be a problem for applications using very little memory but hardly any issue for applications that use a relatively large amount of memory.
We could allow different sizes on each page to mitigate the above issues. However, that would require more information to be stored in the header, and possibly require a header for each allocation. This would also increase space overhead, and it is believed to be very hard to come up with an algorithm that is as efficient as our current allocator, especially since the space utilization and performance are superb compared to the old allocator.
The Three Algorithms
Three alternative algorithms will be described in the following subsections.
Sub allocator that uses a list and a bitmap
This approach is assumed to be the best approach with respect to cache performance. Let’s dive into how it works.
A list is maintained for each supported size of pages where we might have free space. We say “might” here, because when we allocate memory on a page, we do not check whether we are using the last free space on that page. Instead, we remove the page from the list later when we find that we cannot allocate from this page. This approach is slightly more efficient compared to always doing this check and removing it early.
Each page has a header, which contains a bitmap that tells us which part of the page is free with respect to the actual size this page supports. We have a general algorithm to look up in this bitmap, but we also have implementations specific to platforms that can take advantage of hardware optimizations.
Since this algorithm does not touch the memory that is being freed, it is assumed that this should be most friendly to cache performance. However, it requires code that uses this algorithm to not touch the memory when it is about to free the memory in order to take full advantage of it. Also, some early profiling of our product has shown better performance with the last algorithm presented below.
This algorithm is also assumed to be good for paging since we often allocate from pages that in general have been accessed recently.
Sub allocator with list of lists
This algorithm is similar to the previous algorithm except that we use a list instead of a bitmap within the page. We use the memory location that is being freed to hold a pointer to the next free location within the page. This algorithm is assumed to be suboptimal for CPU cache performance but is probably faster when there is no hardware support for lookup in the bitmap.
This algorithm is assumed to be as good for paging as the first algorithm.
Sub allocator with a single list
This is similar to the previous algorithm except that we, for each supported size, have a single list that reaches all the free locations. This algorithm seems to be the most efficient one according to some early profiling we did of RDM. It is the default algorithm used. It is assumed to not be the best for CPU cache performance and is the worst for paging. This algorithm probably beats the other algorithms due to its simplicity. However, this may not be universally true. We suspect that there might be differences based on the CPU architecture or application. In cases where the application is expected to page a lot due to overuse of virtual memory, this algorithm is expected to perform worse.
RDM has two types of stack allocators.
One of them is mainly designed to use the system stack. Allocations are done without a handle. This allocator should only be used for limited size allocations and a maximal size has to be given in the source code. The exact approach, including whether or not these allocations actually use the system stack, is platform dependent.
The other stack allocator is an implementation that uses the sub allocator in its implementation. This stack allocator uses a handle similar to the sub allocator. The main advantage of this type of stack allocator over the sub allocator is that it avoids administering smaller allocations, and a chunk of memory can be freed by freeing something further down on the stack.