This article is a case study of how we improved stability in our critical application. It’s mostly a technical analysis of what happens in fresh Java based instance, how JIT Compiler toyed with us at application start and how we learned to control it.| blog.allegro.tech
Site performance is very important, first of all, from the perspective of users, who expect a good experience when visiting the site. The user should not wait too long for the page to load. We all know how annoying it can be when we want to press an element and it jumps to another place on the page or when we click on a button and then nothing happens for a very long time. The state of a site’s performance in these aspects is measured by Web Vitals performance metrics and most importantly b...| blog.allegro.tech
This story shows our journey in addressing a platform stability issue related to autoscaling, which, paradoxically, added some additional overhead instead of reducing the load. A pivotal part of this narrative is how we used Couchbase — a distributed NoSQL database. If you find yourself intrigued by another enigmatic story involving Couchbase, don’t miss my blog post on tuning expired doc settings.| blog.allegro.tech
At Allegro, we use Kafka as a backbone for asynchronous communication between microservices. With up to 300k messages published and 1M messages consumed every second, it is a key part of our infrastructure. A few months ago, in our main Kafka cluster, we noticed the following discrepancy: while median response times for produce requests were in single-digit milliseconds, the tail latency was much worse. Namely, the p99 latency was up to 1 second, and the p999 latency was up to 3 seconds. This...| blog.allegro.tech