How to make Java sockets faster

In this article I will share the experience of making fast faster by stripping off safety nets carefully developed to prevent people from shooting themselves in the foot.

In modern java development, there are many libraries saving developers from getting their hands dirty coding sockets and network channels – Netty being probably the most popular low-level library out there. But here at Chronicle, we sometimes suffer from NIH syndrome and like to re-implement things…​ Or seriously speaking, we constantly thrive to go faster and be garbage-free, while Netty, being a general-purpose library, doesn’t give us enough flexibility nor performance.

We ended up implementing our own network layer based directly on SocketChannel API, as a subproduct of our other products which needed to work with network – namely, FIX engine and Chronicle Queue replication. Afterwards we asked ourselves – how can we make it even faster? Not that it was slow – it was already pretty fast, but we are obsessed with speed here, and we love to dig further, so we started to look closer at what Java does when we call SocketChannel#write(), or SocketChannel#read(). We invite readers to check the code of sun.nio.ch.SocketChannelImpl, below is an excerpt from SocketChannelImpl#read() cutting out the comments:

public int read(ByteBuffer buf) throws IOException {

    if (buf == null)

        throw new NullPointerException();

    synchronized (readLock) {

        if (!ensureReadOpen())

            return -1;

        int n = 0;

        try {

            begin();

            synchronized (stateLock) {

                if (!isOpen()) {

                    return 0;

                }

                readerThread = NativeThread.current();

            }

            for (;;) {

                n = IOUtil.read(fd, buf, -1, nd);

                if ((n == IOStatus.INTERRUPTED) && isOpen()) {

                    // The system call was interrupted but the channel

                    // is still open, so retry

                    continue;

                }

                return IOStatus.normalize(n);

            }

        } finally {

            readerCleanup();

            end(n > 0 || (n == IOStatus.UNAVAILABLE));

            synchronized (stateLock) {

                if ((n <= 0) && (!isInputOpen))

                    return IOStatus.EOF;

            }

            assert IOStatus.check(n);

        }

    }

}

So in this (fairly convoluted) code, we can see 3 synchronized blocks, and quite a bit of code to handle interrupts (in particular, a call to SocketChannelImpl#begin() sets up interruptor to properly implement AbstractInterruptibleChannel).

Next, the IOUtil#read() looks like a code that actually does read something? Well, not quite:

static int read(FileDescriptor fd, ByteBuffer dst, long position, NativeDispatcher nd) throws IOException

{

    if (dst.isReadOnly())

        throw new IllegalArgumentException("Read-only buffer");

    if (dst instanceof DirectBuffer)

        return readIntoNativeBuffer(fd, dst, position, nd);

    // Substitute a native buffer

    ByteBuffer bb = Util.getTemporaryDirectBuffer(dst.remaining());

    try {

        int n = readIntoNativeBuffer(fd, bb, position, nd);

        bb.flip();

        if (n > 0)

            dst.put(bb);

        return n;

    } finally {

        Util.offerFirstTemporaryDirectBuffer(bb);

    }

}

private static int readIntoNativeBuffer(FileDescriptor fd, ByteBuffer bb, long position, NativeDispatcher nd)

    throws IOException

{

    int pos = bb.position();

    int lim = bb.limit();

    assert (pos <= lim);

    int rem = (pos <= lim ? lim - pos : 0);

    if (rem == 0)

        return 0;

    int n = 0;

    if (position != -1) {

        n = nd.pread(fd, ((DirectBuffer)bb).address() + pos,

                     rem, position);

    } else {

        n = nd.read(fd, ((DirectBuffer)bb).address() + pos, rem);

    }

    if (n > 0)

        bb.position(pos + n);

    return n;

}

So the code here tries to figure out if it has heap memory or direct memory, and if it’s heap memory – copy it to temporary direct memory before proceeding to the next step. Now, the SocketDispatcher#read() just forwards the call to FileDispatcherImpl#read0() which is a native call (we are not interested in #pread() as we know that we called this code with position == -1).

static native int read0(FileDescriptor fd, long address, int len) throws IOException;

Now we can analyse if we can make this faster. In the Chronicle world, we like our network handlers to be single-threaded (to be precise, we like all our code to be single-threaded, and achieve concurrency by using multiple processes instead). Also we prefer using direct memory (we have our own layer on top of raw ByteBuffer called Chronicle-Bytes, but under the bonnet it’s still direct ByteBuffer).

So can we avoid cost of all the locking, synchronization, calls dispatching? Surely! A bit of reflective magic, and instead of calling SocketChannel#read() we do:

@Override

public int read(ByteBuffer buf) throws IOException {

    if (buf == null)

        throw new NullPointerException();

    if (isBlocking() || !isOpen() || !(buf instanceof DirectBuffer))

        return super.read(buf);

    return readInternal(buf);

}

int readInternal(ByteBuffer buf) throws IOException {

    int n = OS.read0(fd, ((DirectBuffer) buf).address() + buf.position(), buf.remaining());

    if ((n == IOStatus.INTERRUPTED) && socketChannel.isOpen()) {

        // The system call was interrupted but the channel

        // is still open, so retry

        return 0;

    }

    int ret = IOStatus.normalize(n);

    if (ret > 0)

        buf.position(buf.position() + ret);

    else if (ret < 0)

        open = false;

    return ret;

}

Here we don’t care about blocking sockets (we don’t use them), closed sockets (if it’s closed probably it’s pointless to optimise it?), or heap byte buffers (again, we don’t use those), so we just forward those cases to the normal SocketChannelImpl. But for the main case we care about, we use MethodHandleAPI, which is, simply speaking, compile-time reflection, which means there’s no reflection overhead at runtime (see java.lang.invoke.MethodHandle javadocs for more details).

I will leave it as a homework assignment for reader to do the same analysis for SocketChannel#write().

No optimization is ever done until you got your numbers, right?

I used an existing benchmark which measures the TCP Queue Replication performance. Unfortunately as the Queue Replication is an enterprise feature (and not open sourced) I am unable to provide the source code for the benchmark, but it’s fairly trivial to write your own one to find out how big is the difference for particular usage scenario (we use JLBH for our benchmarks).

In my case, the benchmark shown 50th percentile going from 6.8 micros to 5.7 micros, and 90th percentile from 8.2 to 7.1, thus shaving off 1.1 microsecond. Although 1 microsecond might sound like a little, comparing to the end-to-end latency of 5.7 micros it is a 19% difference – that doesn’t sound so little anymore?! The benchmark was run on my desktop, quad-core i7-6700K CPU @ 4.00GHz running latest Manjaro Linux.

To find out more or speak to our Sales team contact info@chronicle.software.

Author: Dmitry Pisklov (republished based on content originally posted here https://blog.pisklov.me/blog/2018-08-06/how-to-make-java-sockets-faster-or-how-to-be-naughty-with-your-jdk/)

Chronicle Software

Tens of billions of dollars per day are handled via Chronicle's technology platform. We are trusted to deliver exceptional performance, minimal time to market, and optimal developer efficiency.

Subscribe to Our Newsletter

Featured Products

Data Processing

Chronicle Queue Enterprise

Persisted messaging framework that handles massive throughput. Delivers superior control of outliers, and replicates easily across hosts.

Read more >>