A new era of mass storage devices
One of the main reasons for database vacuuming is storage, not only because of the capacity but also to extend the lifetime of the devices. Protip: If you already know the difference between HDDs and database technologies, jump to the last part to see how we succeeded with our vacuuming technology.
Hard disk drives (HDDs) have ruled the market for many years. Recently a new technology has emerged with solid-state drives (SSDs). Both of these technologies have seen a tremendous development with improvements in speed, reliability, capacity and price. As SSDs become cheaper, they have been replacing HDDs in many applications.
How HDDs work on a conceptual level is well understood by most developers but for SSDs there are a lot of misconceptions. We will therefore give a brief introduction to both technologies to set the scene for the following discussion.
Hard Disk Drives (HDDs)
HDDs store data on rotating magnetic platters using read/write heads mounted on robotic arms. Due to the use of moving parts, reads and writes on HDDs yield high latency compared to integrated circuits (ICs). Depending on the actual HDD model, the data may not be written immediately but instead stored in the controller to be written later. This approach will improve write performance at the cost of durability. To minimize the impact on durability, HDDs are often equipped with a large capacitor or a battery backup making it possible to write the data or store the data in volatile memory in the event of power outage
Because of how HDDs work, data is always read and written in units of pages. Modern HDDs have a page size of 4,096 bytes (or 4 KiB) and the actual layout of the data on the platter is not directly visible to the host.
Solid-State Drives (SSDs)
SSDs store data on special ICs without any moving parts, providing low latency on their read and write operations. compared to HDDs. These ICs are assembled in a particular way for SSDs to be cost effective. These ICs will be referred to as flash memory from here on. The description here is simplified and not entirely accurate as the details also depend on the actual manufacturer and model of the SSD.
SSDs use the same page size as modern hard drives even though the overhead of accessing a page is almost negligible compared to that of a HDD. Reading a small number of bytes from an SSD is much more efficient than from a HDD. The discussion here assumes the page size presented to the host to be the same as the page size of the flash memory. As with the HDD, the actual layout of the data in the flash memory is not directly visible to the host.
The flash memory is organized as a number of sectors and each sector has a number of pages. Each page can be written a limited number of times and before the page can be updated, the whole sector where the page resides has to be erased. This scheme has some ramifications.
When data is updated, SSDs write it to a “fresh” page (copy-on-write), or a page that has not yet been written to since the last time the sector was erased. This write makes the page “in use.” The page that holds the old data is said to be “stale.”
In addition to normal writes, pages in use are regularly copied from sectors with a lot of stale pages to sectors with fresh pages. A sector that has all its pages marked as stale is erased, making all the pages on that sector fresh. This has the net effect of more fresh pages available for normal writes. Erasing a sector can be an expensive operation and often is the bottleneck for the write performance. This process is referred to as garbage collection.
More efficient garbage collection is possible with more sectors with a lot of stale pages. There are in principle three types of data changes; data creations, data updates, and data deletes. Traditional hard drives do not deal with data deletion as pages are instead ignored by the file system. For a solid-state drive, garbage collection can be performed more efficiently if it knows about these ignored pages. This is implemented by the ATA command TRIM or the SCSI command UNMAP.
There are a lot of other aspects about SSDs that need not be mentioned here for the following discussion. Please see Wikipedia for further details.
Traditional databases use data structures that minimize the number of pages read. This approach is crucial when traditional HDDs are used as the time it takes to retrieve the data is proportional to the number of pages read. This is also true for SSDs to some extent as the I/O channels are optimized for HDDs, but the data can be retrieved an order of magnitudes faster from an SSD than from an HDD.
Databases normally implement durability (one of the properties of ACID) using a transaction log, which may or may not include the actual user data. The pages written for the transaction log and the pages written for the updated database image are likely to end up on the same sector since they are written around the same time. This is especially true for small transactions. The pages for the transaction log will normally be made stale before the recently written pages of the database image. This means that those sectors may soon end up with some stale pages that will take up space, or the sectors may soon reach the threshold for garbage collection. Performance may decrease in both cases.
We took a different approach with RDM 15.0. On the file system level, the database consists of a number of pack files that hold user data, indexes, and metadata needed for recovery and vacuuming. These pack files are collectively referred to as a “pack.”
The File System Perspective of the Pack
This section describes some characteristics of the pack as observed by the file system and how they affect the I/O performance of the SSD.
A pack consists of a number of pack files, each with a maximum size of 2 GiB. Pack files are written sequentially by appending to the end. The total size of pack files are held below a threshold. When pack files are deleted, they are always deleted in the order they were created.
Since pack files are written sequentially, a sequence of pages for a given pack are likely to be clustered together, especially on a large transaction load, compared to other I/O loads. When such clustering is assumed, these sectors are not likely to be subject to garbage collection until the pack file is deleted.
If the above assumption is not the case, the database load may have to be kept on a separate partition or a separate SSD to achieve the performance needed. These details are beyond our control, but from a database design perspective, this is the best we can do to achieve I/O performance on an SSD.
Vacuuming in RaimaDB
The pack implementation will be discussed in this section. We will start with observations of previous versions of RaimaDB and present the RDM 15.0 design. If you wish to dive into the technical details, you can learn more in our documentation.
Prior versions of RaimaDB
RDMe 12.0 and older (“RDMe”) used a transaction log. This meant that data was typically written twice – first to the transaction log and then to the actual database files. RDMe also stored items with fixed length, and in some cases, additional content had to be stored separately elsewhere. This design resulted in wasted space and an extra overhead when the content had to be split up. It was not possible to do any type of compression on the data, either.
RDM 14.0 is the first version of RDM that has utilized an in-memory key-value store (ID-index) and pack format for data management as opposed to the traditional transaction log and fixed-length data slots. However, it reused space in the pack files. An update to the pack was always done using a copy-on-write from the original page to the one that had available space. The ID-index kept track of this unused space. This tracking and the data structures needed to efficiently utilize the unused space turned out to be quite expensive. It also prevented reuse of space in combination with bulk write. The I/O performance was far from optimum, either, as data was written into pages when only a fraction of the page was updated.
RDM 14.1 and RDM 14.2 took a slightly different approach. In the following sections, we will mainly discuss the 15.0 design which is similar to the 14.2 design. The 15.0 design also shares some characteristics with the 14.1 design.
RDM 15.0 addresses all the above issues by taking a completely different approach. It still does copy on write but never reuses the unused space in pack files. Instead, it always writes to the end of the last pack file with data structures that do not utilize an ID-index. This means being able to avoid writing to pages with only partial space available.
Unused space in pack files are cleared with a process called “vacuuming.” Vacuuming is similar to garbage collection discussed earlier, except that vacuuming operates on pack files instead of sectors on an SSD. If done right, vacuuming will improve garbage collection considerably.
Disk Space vs Pages Used
Vacuuming has two main objectives weighed against each other: Decreasing the total amount of disk space for a database and decreasing the number of pages the database actually resides on. Optimizing for one could work against the other. RDM 15.0 always vacuums the oldest pack file,” which will favor the first objective. If the oldest pack file is not referenced and otherwise not needed, it is deleted and only then do we actually delete the pack file. RDM 14.1 favored the second objective by vacuuming the pack file with the lowest relative “in-use.” RDM 15.0 can get away with less metadata in the pack since it only vacuums the oldest pack files. This approach also has the benefit of not having to copy the metadata for deleted data as part of vacuuming, further reducing the cost of vacuuming. And finally, optimizing disk space also improves SSD performance.
Read vs Writes
One aspect that should not be overlooked is read versus writes. As we have pointed out above, aggressive vacuuming will limit the overall size of the database. Often more importantly, aggressive vacuuming will cluster data that are in use together with less wasted space in between. This is significant when it comes to what is actually located on each page. Keeping data clustered can mean that more data that is in use may be able to fit in fewer pages and thus may stay in the working set of the file system cache, potentially reducing or eliminating the costly I/O to the underlying block device.
Trade-off between writes and total sizeThere is also a trade-off between the amount of data written and the total size of the pack. Depending on how the application should be designed, or how the administrator or the user has to configure the system, it may be hard to know exactly how to strike the right balance between the two. That is probably the main challenge with this approach. However, keep in mind that the alternatives to this approach are not very efficient. We have seen how inefficient it can be to reuse space within pack files or use a transaction log.
The net result of this approach is that data may be written several times like RDM 12.0 and older, but the difference here is that writes for RDM 15.0 are clustered and, with a proper threshold for vacuuming, will yield excellent I/O performance and make garbage collection in the SSD very efficient as discussed previously.
Pack File Format
The pack file format is important for efficient vacuuming. First, a couple of observations. In principle, vacuuming can be done in a few ways. RDM 14.1 had the capability to vacuum clustered by the slot-ID in the ID-index or clustered by the old position in the pack file. Both of which relied on an ID-index and certain metadata in the pack files.
The approach RDM 15.0 takes is to traverse the data structures and move anything that currently resides in pack files that are subject to vacuuming by appending it to the last pack file. Data structures that contain references to data that are copied must also be moved. The data structures in RDM 15.0 therefore have information about the oldest pack file that can be reached directly or indirectly. This makes it possible to vacuum content efficiently without having to traverse more data than necessary. This approach also has the benefit that content gets clustered based on their keys, making subsequent lookups more efficient.
Need for Locks
Furthermore, vacuuming can be done without acquiring write locks. This means that readers are not particularly affected by vacuuming. Updates are affected as vacuuming for a given table cannot be done in parallel while an update is ongoing for that table; however, it is expected that the amount of data to be vacuumed can be an order of magnitude less than the amount of data written for an update, though often it may just be a factor of two or three depending on how aggressively we vacuum.
We have seen that the file format for RDM 15.0 is especially well suited to minimize disk I/O and has properties that are very well suited for persistent storage including SSD and flash memory. By minimizing disk, I/O the devices will have extended lifetime. Many details have been left out to simplify the discussion. We hope you enjoyed it.
Get started with RaimaDB today
Try RaimaDB for free today and see how screaming fast data management can get you to market on schedule and under budget.