Some use cases require data to be stored for a limited time. Time series is one such use case where data is keyed using a timestamp. Measurements may be collected from sensors at a high frequency, keyed and stored, retrieved later using the key, and then analyzed. Due to the volume, some of this data may need to be purged at a later time. This is where circular tables and RDM’s optimized primary keys thrive.
Circular tables and the redesigned primary keys for RDM 15.0 are optimized with this use case in mind. When data is stored in circular tables, the assumption is that the data in general are received and inserted in key order¹. When the circular table is “full,” the oldest data is automatically purged from the database. This pattern of operation can be an order of magnitude more efficient compared to inserts that are not ordered². This may be the case both with respect to CPU-time and the amount of I/O. The reason for this increased efficiency is that the internal data structures are very efficient when all the inserts are with increasing primary key values and all the deletes are with the minimum primary key values.
Another design decision in RDM that contributes to the efficiency of circular tables is that RDM does copy-on-write without using a transaction log. This is beneficial for two reasons. First, without a transaction log, the data is written only once.³ Second, with copy-on-write, a snapshot of data can be implemented very efficiently, giving access to data without needing locks and still having a consistent view of data previously written.
Data added to circular tables and committed can be made available for a snapshot shortly thereafter. Applications that do this will observe greater performance compared to applications that use the data at a much later time. This is due to caching. Without an efficient snapshot implementation, it might not have been possible to use the data shortly after insertion due to the performance penalty caused by having to rely on locks.
Lastly, circular tables where the oldest data is purged often do not end up being copied during vacuuming since the data have already been purged when the actual pack file is subject to vacuuming.
Using a Primary Key instead of a Row-ID
Previous versions of RDM always used the Row-ID to locate row data. With RDM 15.0, we changed that. In the case where a primary key is defined, the primary key value is used instead of the Row-ID. This has a couple of additional benefits. First, for foreign references, we no longer need to look up the foreign row to find the corresponding primary key since the primary key is used for the reference. Second, the row can be encoded without including the primary key since the primary key is used for any reference to it.
RDM has the capability of opening a number of databases for read in one view (database union). The optimized primary key, where we use a primary key instead of a Row-ID to locate the row data, makes the union more efficient at locating the date ordered by the primary key. It often did not make sense to retrieve the data in Row-ID order where rows with the same Row-ID from different databases were placed together, since the Row-IDs from different databases may not be related.
¹Or close to key order
²This statement is less true if the amount of data for each primary key value is large or transactions only insert one or a few rows at a time
³Assuming vacuuming does not happen