The Big Question
What is Chronicle Queue?
Chronicle Queue is a persisted journal of messages which supports concurrent writers and readers even across multiple JVMs on the same machine. Every reader sees every message, and a reader can join at any time and still see every message. We assume that you can read/scan through messages fast enough that, even if you are not interested in most messages, it will still be fast enough. These are not consumers as such, and messages do not disappear after reading them.
- a message can be replayed as many times as is needed.
- a day of production messages can be replayed in a testing environment months later.
- reduces the requirement for logging almost entirely, speeding up your application.
- greater transparency leads to optimisations that you might have missed without a detailed record of every event that you process.
One of our clients, first implemented their trading system without Chronicle Queue and achieved an average latency of 35 micro-seconds tick to trade. However, after switching to Chronicle Queue, the latency was reduced to 23 micro-seconds.
What Makes Chronicle Queue Different from Similar Solutions?
Speed
Chronicle Queue is designed to comfortably support hundreds of thousands of messages per second. It can also handle multi-second bursts into the millions of messages per second. One of our clients reported they are handling bursts of 24 million messages per second with a cluster of 6 servers.
Without Flow Control
Chronicle Queue is an unbounded queue without flow control between the producer and the consumer. Replication supports flow control, but this is not used by default. A lack of flow control is known as a “producer-centric” solution.
Consumer-centric solution
To put that in context, most messaging solutions have flow control; this is a “consumer-centric” solution. Consumer-centric solutions make sense in many applications, especially between servers and client GUIs. Any data that you send to a client GUI is for display to a human. Desktop applications can be run on a variety of machines, over a variety of networks, and a human can only see a limited rate of data. Updates of more than about 20 times a second just appear as a blur, and are counter productive.
It makes sense for a desktop application to receive only as much data as it can handle, and to push back on the server when it cannot handle the data rate.
Reactive Streams are an excellent way to model such client centric solutions.
Producer-centric solution
Chronicle Queue deliberately does not use flow control, as flow control is not always possible, or even desirable.
Some examples of systems where Chronicle Queue is often used are:
- Market data gateways. You cannot slow down an exchange because you are not keeping up.
- Compliance systems. You have to feed information to them, but you never want to be slowed down by them.
- Core trading systems. As these feed from Market Data gateways, you want them to be as fast as possible, all the time.
In a Producer-centric solution, the producer is not slowed down by a slow consumer, and the consumer is never more than main memory behind the producer. The consumer might be an overnight batch job, and be a whole day, or week behind.
The producer-centric solution has a number of advantages:
- You can reproduce a bug, even if it only occurs once in a million messages, by replaying all the messages which led to that bug triggering. More importantly, you can have confidence that the bug, rather than just a bug has been fixed.
- You can test every micro-service independently, as there is no flow control interaction between them. If you have flow control, say 20 services, any one of those services could slow down any producer, until your entire system locks up. This means the only real performance test is a complete system test.
- You can test a micro-service, replaying from the same input file repeatedly, without the producer or down stream consumers running.
- You can restart and upgrade a service, by replaying its outputs to ensure it is committed to the decisions/outcomes it has produced in the past. This allows you to change your service and know that, even those your new version might have made different decision, it can honour those already made.
- Flow control is designed to smooth out burst and jitter in the system; but it can also hide such jitter as result.
The main disadvantage is that this assumes disk space is cheap. If you look at retail prices for enterprise grade disks, you can get TBs of disk for a few hundred dollars. Many investment banks do not take this approach, and internal charge rates for managed storage can be orders of magnitude higher. I have worked on systems that have more free memory than free disk space.
What technology does Chronicle Queue Use?
What might be surprising is that Chronicle Queue is written entirely in pure Java
. It can outperform many data storage solutions written in C
. You might be wondering, how is this possible given that well written C
is usually faster than Java.
You need a degree of protection between your application and your data storage to minimise the risk of corruption. As Java
uses a JVM
, it already has an abstraction layer, and a degree of protection.If an application throws an exception, this does not mean the data structure is corrupted. To get a degree of isolation in C
, many data storage solutions use TCP
. The overhead of using TCP
, even over loopback, can exceed the benefit of using C
, and the throughput/latencies of Chronicle Queue can be 10x
greater by being able to use the data in-process. Chronicle Queue supports sharing of the data structure in memory for multiple JVMs, avoiding the need to use TCP to share data.
How Do We Reduce garbage?
For the most latency-sensitive systems, you may want to keep your allocation rate to below 300
KB/s. At this rate you will produce less than 24
GB of garbage a day, and if your Eden space
is larger than this, you can run all day without a minor collection. A GC is something that you can do as an overnight maintenance task. Reduce your garbage-per-day to less than 5
GB, and you might be able to run all week without a GC.
We have a number of strategies to minimise garbage; the key one being that we translate directly between on-heap and native memory without intermediate temporary objects.
We use object pools where appropriate, and we support reading into mutable objects.