“Copy on write” is a technique used in a wide variety of applications in computing. RDM uses this technique in a couple of areas. Before diving into RDM, let’s take a look at the general case.
The main idea behind copy-on-write is to copy some content and then modify the copy instead of modifying the original when writing to memory, a file, or a block device. That’s where the term copy-on-write came from. Some applications are copy-on-write in functional programming in combination with “reference counting.” In Unix, copy-on-write is used in sharing the virtual memory in the implementation of the fork system call with some assistance from hardware. The Btrfs file system in Linux uses copy-on-write to allow efficient transaction handling and snapshot isolation. This technique is also used by hypervisors and virtual machine implementations for efficient memory management.
In RDM, we use this technique for nested transactions. However, this article has its focus on the database file format implemented in RDM 15.0. First, let’s take a look at the previous versions of RDM.
Previous versions of RDM
RDMe 12.0 and older (“RDMe”) used a transaction log. This meant that data was typically written twice – first to the transaction log and then to the actual database files. RDMe also stored items with fixed length, and in some cases, additional content had to be stored separately elsewhere. This design resulted in wasted space, incurred an extra overhead when the content had to be split up, and was in general not very flexible. Because of the fixed length records, it was also not useful to do any type of compression on the data.
To mitigate these issues, RDM 14.0 implemented variable size records with updates using “copy on write.” The database files were split into what we termed “pack files.” An update of the pack was always done with doing “copy on write.” However, RDM 14.0 did not do a full recursive “copy on write.” Instead, it relied on a key-value store (ID-index) whenever content was moved from one location in the pack to another location.
This design had its own issues. The ID-index kept track of unused space. This tracking and the data structures needed to efficiently utilize the unused space turned out to be quite expensive. It also prevented reuse of space in combination with bulk write. The I/O performance was far from optimum, either, as data was written into pages when only a fraction of the page was updated.
This design also needed a transaction log for the ID-index, which was integrated into the pack for RDM 14.1. Flushing the ID-index was also relatively expensive, which resulted in potentially slow database open and close. These issues made this design unsuitable for applications like the REST-API, which relies on opening and closing the database for every single request.
RDM 14.2 took a slightly different approach from RDM 14.1. In the following section, we will mainly discuss the 15.0 design which shares many characteristics with the 14.2 design, as well as some characteristics with the 14.1 design.
RDM 15.0 is built on two principles.
The first principle we rely on is full recursive “copy on write.” This means that the persistent data structure used in RDM 15.0 is a tree (or a tree of trees depending on the level of abstraction) without use of an ID-index. Two tree structures are used: B-trees and R-trees. The database will have one tree for each key defined in the schema, but there are other trees needed as well. The details of this are beyond the scope of this article. To conclude this discussion, we mention that to tie all of these trees together, there is also a main tree which has references to the root nodes of the other trees.
The second principle we rely on is always appending to the end of the last pack file.
Therefore, pack files are never updated in the middle. This means that expensive updates in the middle where only a small fraction of a page is updated are avoided, and there is also no need to manage free space. Space that is no longer referenced simply ends up not being accessed.
Appending and Writing Changes “Bottom Up”
There are quite a few implications that follow from these two principles. One implication that naturally follows is that all updates must be done by appending them to the end of the pack. RDM commits a transaction by first writing new and updated items that do not have direct or indirect references to other updated items, followed by tree nodes that reference them, their parent tree nodes, their grandparent tree nodes, and so on until the root node of the main B-tree has been written.
This design means there will be additional nodes that must be written even though there were otherwise no changes to them. This is typically not a concern for large transactions, where multiple updates are more likely to occur on the same B-tree node. On the other hand, the extra overhead may be substantial for small transactions with just a few changes. The best performance can be expected for use cases that involve time series or circular tables.
With recursive “copy on write,” snapshots can be implemented very efficiently. When a snapshot is requested, the current root node of the main B-tree can be used by the client requesting the snapshot and its view of the data is derived from what can be reached from this node. The only difference between a snapshot and a read using read locks is that a client requesting a read lock will block updates to the tables on which the read locks are requested, while a snapshot will not. No other additional expensive internal data structures are needed to handle the snapshot. The client requesting the snapshot can simply handle requests for content from the Transactional File Server (TFS) exactly how it would handle a normal read with locks. It gets a main B-tree root node and based on the references therein, it requests other nodes from the TFS.
As a result, the pack file format utilizing copy-on-write is well suited for snapshots. When a snapshot is requested, old data can be referenced in the pack. From the pack file implementation standpoint, we just need to ensure that pack files that have data visible to a snapshot are not deleted.
Since data is only appended to the end of the last pack file, whose file format is carefully designed, durability can be achieved in most cases simply by syncing the last pack file. This has some performance benefits compared to other databases that need to sync several files, including the older versions of RDM.
It turns out that most syncs can be omitted if durability is not required. A system crash may cause some data loss in this case. However, since we are only appending to a file, if a main B-tree root node is found, all the data referenced from this root node should be valid. This assumes a file system implementation that does not lose blocks in the middle of a file when only appending occurs to that file. There is one precaution RDM must take for this to work: Any time a new pack file is requested, the previous pack file has to be synced (and on Linux and Unix, we also have to sync the directory). This to make sure that any reference from one pack file to a previous pack file should not be lost.
We have also implemented an unsafe mode, where we do not sync pack files at all. This mode should only be used if losing data in the database is not a concern or it can be guaranteed that the operating system does not shut down or crash without syncing files. It is OK for the database to crash since the content written will be in the file system cache. Note that this may not be the case with some embedded systems, where the file system is implemented in user space.
For recovery, the pack file format has been crafted to make it possible to read the pack file both forward and backward. For every 64KiB, there is special recovery information that makes it possible to scan the file forward and backward without having to read the whole file from the beginning. This property is important for efficient and reliable recovery. Recovery does not write anything to the pack but instead finds the last main B-tree root node and verifies the end of the pack. If a database is found to be in a certain incomplete state, a new pack file will be created. This simply means that crashes may leave contents in the pack files that are not directly or indirectly referenced by the main B-tree root node.
The pack file format allows multiple update transactions to be active in parallel, but when they commit, appending to the pack file will alternate among those that are currently committing. A small transaction may be able to append all its data in one chunk while larger transactions will need to take turns writing their data. The main B-tree needs to be written exclusively and each time that is done, changes from other update transactions or the vacuumer (a garbage collector, we cover later) have to be integrated with the changes for the transaction at hand.
The old design with the ID-index as well as implementations based on fixed size items required data for a given table that was in the runtime cache to be purged any time we had another client making updates to the same table. With the RDM 15.0 design, we don’t need to purge any data. The runtime cache always looks up data using the pack address and is therefore guaranteed to find the right data if it is cached. This means that data may be able to stay in the runtime cache for a longer time, potentially increasing performance.
Parallelism and Direct Read and Writes
The old design with the ID-index did not allow operations to run in parallel. The RDM 15.0 design that utilizes copy-on-write allows most operations from different clients to be run in parallel. The file format even allows most file operations to be done in parallel from different processes on the same machine reading and writing to the pack files directly without communicating to the TFS. This approach yields greater performance.
While the RDM 15.0 design has many advantages as described above, it has one major drawback. Since we only append to the end of the pack, it will grow indefinitely as a result. To address this issue, we execute a garbage collection process termed “vacuuming.” Vacuuming will be discussed in a separate article.
We have seen that the file format for RDM 15.0 is flexible and well suited for snapshot isolation, parallelism, low memory footprint, and fast recovery. We suggest you continue reading about snapshot isolation and vacuuming. Enjoy!
by Sverre Hvammen Johansen and Daigoro Toyama