In our previous post, we debugged complex race conditions in our failure-handling mechanisms. Our log system can now survive node failures and continue making progress through segment transitions. Is this enough to build a cloud database using our log system? There is still one missing feature that would make building a real system much easier: single producer guarantees.| Benjamin Hilprecht
Our happy path design assumed we could write to the same three nodes forever. That assumption breaks down the moment any node becomes unavailable. In this case, we still want to be able to write log entries.| Benjamin Hilprecht
In the previous post, we introduced failures in our system and added logic to open a new segment when a write fails. This design already looked quite robust. But unfortunately, there is still a complex liveness bug. In this post, we will play detective and hunt this bug with a visualization tool. The full code of this blog post can as usual be found on github.| Benjamin Hilprecht
In our previous post, we explored why the Taurus approach to distributed logs is compelling. Now it’s time to prove it actually works—starting with the simplest possible scenario where everything works perfectly. You can find the code of the full specification on github.| Benjamin Hilprecht
Cloud databases face a fundamental challenge: how to remain available and durable under node failures? Modern cloud databases approach this by separating two concerns that used to be tightly coupled: compute and storage. The database engine becomes stateless, while the write-ahead log gets replicated across multiple nodes to guarantee durability. If a database server dies, another one can pick up exactly where it left off by reading from the replicated log.| Benjamin Hilprecht