A new era of mass storage devices
One of the main reasons for database vacuuming is storage, not only because of the capacity but also to extend the lifetime of the devices. Protip: If you already know the difference between HDDs and database technologies, jump to the last part to see how we succeeded with our vacuuming technology.
Hard disk drives (HDDs) have ruled the market for many years. Recently a new technology has emerged with solid-state drives (SSDs). Both of these technologies have seen a tremendous development with improvements in speed, reliability, capacity and price. As SSDs become cheaper, they have been replacing HDDs in many applications.
How HDDs work on a conceptual level is well understood by most developers but for SSDs there are a lot of misconceptions. We will therefore give a brief introduction to both technologies to set the scene for the following discussion.
Hard Disk Drives (HDDs)
HDDs store data on rotating magnetic platters using read/write heads mounted on robotic arms. Due to the use of moving parts, reads and writes on HDDs yield high latency compared to integrated circuits (ICs). Depending on the actual HDD model, the data may not be written immediately but instead stored in the controller to be written later. This approach will improve write performance at the cost of durability. To minimize the impact on durability, HDDs are often equipped with a large capacitor or a battery backup making it possible to write the data or store the data in volatile memory in the event of power outage
Because of how HDDs work, data is always read and written in units of pages. Modern HDDs have a page size of 4,096 bytes (or 4 KiB) and the actual layout of the data on the platter is not directly visible to the host.
Solid-State Drives (SSDs)
SSDs stores data on special ICs without any moving parts, providing low latency on their read and write operations. compared to HDDs. These ICs are assembled in a particular way for SSDs to be cost effective. These ICs will be referred to as flash memory from here on. The description here is simplified and not entirely accurate as the details also depend on the actual manufacturer and model of the SSD.
SSDs use the same page size as modern hard drives even though the overhead of accessing a page is almost negligible compared to that of a HDD. Reading a small number of bytes from an SSD is much more efficient than from a HDD. The discussion here assumes the page size presented to the host to be the same as the page size of the flash memory. As with the HDD, the actual layout of the data in the flash memory is not directly visible to the host.
The flash memory is organized as a number of sectors and each sector has a number of pages. Each page can be written a limited number of times and before the page can be updated, the whole sector where the page resides has to be erased. This scheme has some ramifications.
When data is updated, itan SSD writes it to a “fresh” page (copy on write), or a page that has not yet been written to since the last time the sector was erased. This write makes the page “in use.” The page that holds the old data is said to be “stale.”
In addition to normal writes, pages in use are regularly copied from sectors with a lot of stale pages to sectors with fresh pages. A sector that has all its pages marked as stale is erased, making all the pages on that sector fresh. This has the net effect of more fresh pages available for normal writes. Erasing a sector can be an expensive operation and often is the bottleneck for the write performance. This process is referred to as garbage collection.
More efficient garbage collection is possible with more sectors with a lot of stale pages. There are in principle three types of data changes; data creations, data updates, and data deletes. Traditional hard drives do not deal with data deletion as pages are instead ignored by the file system. For a solid-state drive, garbage collection can be performed more efficiently if it knows about these ignored pages. This is implemented by the ATA command TRIM or the SCSI command UNMAP.
There are a lot of other aspects about SSDs that need not be mentioned here for the following discussion. Please see Wikipedia for further details.
Traditional databases uses data structures that minimize the number of pages read. This approach is crucial when traditional HDDs are used as the time it takes to retrieve the data is proportional to the number of pages read. This is also true for SSDs to some extent as the I/O channels are optimized for HDDs, but the time it takes to retrieve the data compared to an HDD can be order of magnitudes faster.
Databases normally implement durability (one of the properties of ACID) using a transaction log, which may or may not include the actual user data. The pages written for the transaction log and the pages written for the updated database image is likely to end up on the same sector since they are written around the same time. This is especially true for small transactions. The pages for the transaction log will normally be made stale before the recently written pages of the database image. This means that those sectors may soon end up with some stale pages that will take up space, or the sectors may soon reach the threshold for garbage collection. Performance may decrease in both cases.
We took a different approach with RDM 14.1. On the file system level, the database consists of two parts: A key-value store, called the ID-index, and a pack that hold user data, indexes, and metadata needed for recovery and vacuuming.
This article will not provide in-depth discussion about the ID-index except to point out some of its characteristics that can be observed by the file system. Essentially, the ID-index is an in-memory data structure that is only occasionally written to disk. The ID-index writes neighboring pages together; therefore it is assumed that a lot of those pages end up on the same sector in the case of an SSD. The next time the ID-index writes the same pages, the old pages will be made stale, which is likely to trigger garbage collection since several pages were made stale at the same time. For the best I/O performance of the ID-index, it is important that it is not flushed often.
The File System Perspective of the Pack
This section describes some characteristics of the pack as observed by the file system and how they affect the I/O performance of the SSD.
A pack consists of a number of pack files, each with a maximum size of 2 GiB. Pack files are written sequentially by appending to the end. The total size of database files (ID-index plus pack) are held below a threshold. When pack files are deleted, they are always deleted in the order they were created.
Since pack files are written sequentially, a sequence of pages for a given pack are likely to be clustered together, especially on a large transaction load, compared to other I/O loads. When such clustering is assumed, these sectors are not likely to be subject to garbage collection until the pack file is deleted.
If the above assumption is not the case, the database load may have to be kept on a separate partition or a separate SSD to achieve the performance needed. These details are beyond our control, but from a database design perspective, this is the best we can do to achieve I/O performance on an SSD.
Vacuuming in Raima Database Manager (RDM)
The pack implementation will be discussed in this section. We will start with observations of previous versions of RDM and present the RDM 14.1 design. If you want to dive into the technical stuff, you can learn more in our documentation.
Prior versions of RDM
RDMe 12.0 and older (”RDMe”) used a transaction log. This meant that data was typically written twice - first to the transaction log and then to the actual database files. RDMe also ,stored items with fixed length, and in some cases, an additional content had to be stored separately elsewhere. This design resulted in wasted space and an extra overhead when the content had to be split up. It was not possible to do any type of compression on the data, either.
RDM 14.0 is the first version of RDM that has utilized ID-index and pack format for data management as opposed to the traditional transaction log and fixed-length data slots. However, it reused space in the pack files. An update of the pack was always done using a copy on write from the original page to the one that had available space. The ID-index kept track of this unused space. This tracking and the data structures needed to efficiently utilize the unused space turned out to be quite expensive. It also prevented reuse of space in combination with bulk write. The I/O performance was far from optimum, either, as data was written into pages when only a fraction of the page was updated.
RDM 14.1 addresses all the above issues by taking a completely different approach. it still does copy on write but never reuses the unused space in pack files. Instead, it always writes to the end of a pack file. This means being able to avoid writing to pages with only partial space available.
The ID-index in RDM 14.1 does not keep track of unused space in the ID-index. Instead, it keeps track of the total amount of used space, one number for each pack file. This overhead is negligible compared to the overhead for RDM 14.0. This information is used to identify pack files with low “in use,” where the pack file with the lowest “in use” is subject to a process called “vacuuming.” Vacuuming is similar to garbage collection discussed earlier, except that vacuuming operates on pack files instead of sectors on an SSD. If done right, vacuuming will improve garbage collection considerably.
Vacuuming has two main objectives weighed against each other: Decrease the total amount of disk space for a database and decreasing the number of pages the database resides on. Optimizing for one will work against the other. RDM 14.1 vacuums the pack file with the lowest relative “in use,” which will favor the second objective. If the oldest pack file is down to no space in use, it is deleted and only then do we actually delete pack files. This also has the benefit of not having to copy the metadata for deleted data as part of vacuuming, further reducing the cost of vacuuming.
Vacuuming of the oldest pack file is explicitly triggered to keep the total database size below a threshold. Setting this threshold too low may hurt the second objective thereby degrading disk I/O performance. The threshold should be set according to the data expected.
The net result of this approach is that data may be written several times like RDM 12.0 and older, but the difference here is that writes for RDM 14.1 are clustered and, with a proper threshold for vacuuming, will yield excellent I/O performance and make garbage collection in the SSD very efficient as discussed previously.
Pack File Format
The pack file format is important for efficient vacuuming. First, a couple of observations. In principle, vacuuming can be done in two ways. Clustered by the slot-ID in the ID-index or clustered by the old position in the pack file.
For the purpose of vacuuming, it should be fine to sequentially read the pack file subject to vacuuming as long as it is done only once, for the following reasons. First, writes are more expensive than reads, and any data has to be written to the file in the first place. Sweeping the file once later should only slightly affect the efficiency of our algorithm, and we believe the additional cost is less significant than that of alternative methods that avoid sweeping. Second, we expect to copy a fraction of the data, and that part has to be read in any case. To optimize for the case where only a low percentage of the data needs to be vacuumed, the metadata and the user data are clustered separately. This will minimize the number of pages read for vacuuming.
Furthermore, the pack file format has been crafted to make it possible to read the pack file both forward and backward. For every 64KiB, there is special recovery information that makes it possible to scan the file forward and backward without having to read the whole file from the beginning. This property is important for doing vacuuming, complete rebuild of the ID-index, and efficient and reliable recovery. This article will not discuss recovery except to mention that the pack file contains everything that is needed for recovery after a crash and on a database open in case the persistent ID-index was not flushed when we last closed the database.
The pack file format utilizing copy on write is also well suited for snapshots. When a snapshot is requested, old data can be referenced in the pack. From the pack file implementation standpoint we just need to ensure that pack files that have data visible from a snapshot are not deleted.
Since data is only appended to the end of a pack file, durability can be achieved in most cases simply by syncing one file. This has some performance benefits compared to other databases that need to sync several files, including older versions of RDM. The file format can also allow most syncs to be omitted if durability is not necessary. In that case, a system crash may lose some transactions that were not properly written to disk, but recovery will ensure consistency of the data. Recovery will be able to bring the database back to a consistent state.
We have seen that the file format for RDM 14.1 is especially suited to minimize disk I/O and has properties that are very well suited for persistent storage including SSD and flash memory. By minimizing disk I/O the devices will have extended lifetime. Many details have been left out to simplify the discussion. We hope you enjoyed it.