This article is different from my usual posts. It explains things that may be obvious to many database professionals – not all of them though.
The idea came indirectly from my friend Francesco Allertsen. He has a weekly mailing list he uses to share links to interesting articles he reads on the web. One of them was The hidden components of Web caching. Its purpose is to list all caches that play some role when we interact with a web site. An interesting idea, even if I find it incomplete. So I thought it was a good idea to talk about caches that we hit whenever we interact with a database.
Why should we care?
But first, a note on why we should care:
- Caches increase the speed of IO by orders of magnitude. Latency numbers that everyone should know gives you an idea of that. The first comment suggests to use a solar system image to visualise the scale; curiously I had the same idea, and I used this image for some slides I made in the past.
- When reliability is important, caches can get in the way in a dangerous way. Because caches are volatile, so they won’t survive a crash or other types of hardware/software failure (aka bugs and limitations). So when we write data, for certain use cases (financial transactions, etc) at no time data should be cached but not written to disk. Or it can happen temporarily before the database says “Ok Mr. Application, I got your data, you can resume your job”.
What guarantees this requirement? Well, caches can be write-through, write-around or write-back. Write-through and write-around caches are reliable in this respect, because data is always written to disk before the control is returned to the writer. Write-back are not reliable, because data is made persistent asynchronously, after the control has been returned. But of course they are much faster because they allow to groups more IO operations together (+ throughput), because latency is very low and because there is no stall if the IO capacity is currently saturated.
It’s all about tradeoffs
Depending on the use case, we will have to choice the proper tradeoff between reliability and performance. For example, InnoDB allows to:
- Flush changes to disks at every commit: even if mysqld crashes, no data loss is possible if you don’t hit any software/hardware bug;
- Flush changes to the filesystem cache at every commit: a filesystem crash is the only event that can cause data loss, and it is not likely;
- Flush data to disk once a second, or even longer intervals.
Also, when we make the choice, we should keep into account data redundancy. For example, if we run a Galera Cluster, we have at least 3 copies of the data on different servers. Or we could store data on a RAID array, which also guarantees that we have multiple copies of data. Failover guarantees that our services don’t break if one copy gets damaged, and we can restore it from another copy. In such cases, even if we deal with critical data, we don’t necessarily need InnoDB to store data in the most reliable way – which is the slowest.
Types of caches
Enough blah blah. Here is the list of caches that could get in the way when we try to persist our data, starting from the lowest levels:
Disk caches – Both spinning disks and SSD can have caches. In your laptop, this cache is most probably write-back. This can usually be changed with something like hdparm.
Drive controllers and RAIDs caches – These can be write-through or write-back, and usually they are configurable. Notice that they could also be battery-backed, which means that they will survive a crash (unless the device has no power for a long time). Battery-backed caches can safely be write-back, but for other caches a write-through strategy could be necessary. Battery-backed RAIDs need a periodic learning cycle. A learning cycle slows down all operations sensibly, but it is necessary to be sure that the battery is totally charged. This operation should be scheduled carefully.
Filesystem cache – You can use it in a write-through or write-back fashion. This topic is amazingly complex (just like the previous ones), so I’ll just give you a link to a wonderful article: Files are Hard.
Virtual machines – Virtual machines have a disk interface cache. Its write strategy depends on the cache mode. There are several cache modes, but here we’ll only mention the reliable ones:
none, which means that the VM doesn’t cache data (but the host system can), and
writethrough, whose meaning should now be clear. Virtual machines also have a filesystem cache, of course. Note that having reliable settings on the VM guarantees that data changes will survive if the VM or anything running in it will crash; but if the host doesn’t have reliable settings and it crashes, most recent changes could be lost. Still, in production, typically a hypervisor runs many VMs. If many VMs bypass the cache, hypervisor’s IO capacity can be easily saturated. It could be better to build a database cluster with VMs running on separate physical hosts, to reduce the risk of data loss in case of crashes – in other words, often it’s better to rely on redundancy and failover, rather than reliability of individual systems.
Transaction log buffer – Terminology differ from DBMS to DBMS (WAL, transaction logs…), but the idea is that changes are persistent when they hit these logs. They will also need to be written to data files, but if they are in the logs they are safe. These logs have a buffer, which contains data not yet flushed. This idea can be confusing for some, so I’ll make it clear: this speeds up things, but doesn’t cause any danger. If your flush strategy is a write-through one, the buffer will contain not yet committed changes, which are flushed on commit – and only after the flush, the DBMS will report success. Regardless your flush strategy, some changes are flushed if the buffer gets full.
Binary log buffer – There is not necessarily a binary log separated from the transaction logs. MySQL has it because its architecture requires it – binary log contains all changes to data and it’s handled by the server, transaction logs contain information necessary to replay or rollback transactions and it’s handled by InnoDB (actually even non-transactional storage engines can have logs, but I’m not going to discuss this here). Considerations about the transaction logs apply to the binary log as well, keeping in mind that its purpose is different (incremental backups and replication, not crash recovery). In Postgres you have WAL files, which are used both for incremental backups/replication and for crash recovery.
Buffer pool – Most databases (Postgres is a famous exception) have a buffer pool to cache frequently accessed data and indexes. It can even contain dirty pages: changes that are not yet written to data files. This makes things much faster. And again: changes are persistent when they are written to transaction logs. Even after a crash, data files can be repaired using transaction logs.
Session buffers, work mem – These buffers speed up parts of query execution, like joins and sorting. However they have nothing to do with writes.
Query cache – MySQL older versions, MariaDB, and maybe other DBMS’s (not sure, sorry) have a query cache. This can speed up reads when the very same query is ran often. “Very same” means that hashes of the queries are compared, so any difference is relevant, including whitespaces. Every time a table is written, all queries mentioning the table are invalidated in the cache. This and its well-known scalability problems make it usually a bad idea, at least in MariaDB/MySQL (there are exceptions – for example, if you have small concurrency, a reasonable number of very slow queries and not many writes).
Proxy caches – Proxies, like ProxySQL, can also have a query cache. It can have problems and it is not necessarily useful, but at least it is supposed to be built with scalability in mind (because proxies are about scalability).
Redis & friends – This should be obvious: retrieving data from a cache system (like Redis or Memcached) is much faster than retrieving it from MySQL. Usually those data have a TTL (time to live), which determines when they will expire, and they can also be invalidated manually. Keep in mind that this makes response times unpredictable: if data is cached response time is X, if it is expired time is Y – where X and Y could be very different. It is even more unpredictable if this cache is not enough to contain all your hot data. So you should be careful about what to cache, unless your dataset is small. Note that these caches could also use the disks: for example, older Redis versions had Virtual Memory (currently deprecated). But we will not dig into this, as our focus is the persistent database. The point is: these caches can avoid database queries, but not always.
Application – No matter how fast proxy’s query cache, Redis and Memcached are: retrieving data from local RAM is much faster. No network round trip, no other servers response time involved. Of course you shouldn’t cache locally a big amount of data, or your memory will not be enough and your application could suffer. And cache invalidation can be a very complex problem to consider. But still, for hottest small data, local memory is the fastest option. To avoid making the response time unpredictable, it’s better to keep application-level caches updated, instead of running queries when an entry expires. Writes to the database are still necessary, and they can be synchronous or asynchronous, depending on how critical these data are.
Trust no one
A famous fairy said that some lies have short legs and others have long a nose. If hard disks, controllers and even filesystems had noses, some of them would have a long nose.
I will not dig into this complex topic myself, but the take away of this paragraph is: don’t trust them. They sometimes lie about consistency, so benchmarks are more impressive and marketing people are happy. Instead, try diskchecker.pl. It will tell you if something in your system is lying. It will not tell you if it is the hard disk, or the controller, or something in the OS. But it will tell you if data it writes are actually persisted immediately.
If your data are on the cloud, you cannot use this tool – because it involves shutting down the physical server suddenly while a file is being written. I am not aware of any tool or procedure to check if your cloud provider is lying about persistence. If you know one, please write a comment to this post. That would be much appreciated.
Databases don’t lie – at least, I am not aware of any DBMS or object store lying about persistence. But they have bugs, just like any piece of software, so you should check them periodically. Here is a PostgreSQL example.