Tuesday 4 January 2011

Low latency systems in .NET

Not new, but since i'm currently in a low latency mood ;-)

Back to june 2009, Rapid Addition, one of the leading supplier of front office messaging components to the global financial services industry, published a white paper with Microsoft on how to build ultra-low latency FIX engine with 3.5 .NET framework...

Their whitepaper (and RA Generation Zero framework) mostly indicates how to prevent from having gen2 garbage collection.

I really like their description of a managed low latency system with a startup phase (with memory resource pooling preallocation), a forced full GC phase, then a continuous operation phase for our low latency systems (where we avoid to cause garbage collection by using and recycling resource pools).

According to me, this white paper demonstrates that Rapid Addition really know what they are talking about.

In another press communication, they have said that their FIX engine breaks the 10 micro seconds average latency with a 12,000 messages/second throughput (still with the 3.5 .NET framework based solution).

Very informative.

Sunday 2 January 2011

A new perspective for Ultra low latency performant systems

I just watched an awesome presentation by Martin Thompson and Michael Barker. They explained how they implemented their ultra low latency (with high throughput) systems for the London Multi-Asset eXchange (LMAX) and it's pretty impressive: 100 000 transactions per seconds at less than 1 millisecond latency in Java...

Since I'm working on FX and low latency systems in general about several years from now, I was very interested by their teasing (100K...). I have to admit that I was thrilled by their presentation.

For those that didn't have one hour to kill watching their video, here is a summary:

---HOW TO DO 100K TPS AT LESS THAN 1ms LATENCY----------------------------

  • You have to know how modern hardwares work in order to build ultra low latency systems
  • The advent of multi-cores with their bigger and smarter caches (do you really know about how proc cache synchronization is working? false sharing drawbacks? etc)
  • Ok, free lunch is over (for Ghz), but it's time to order and use more memory!!! (144GB servers with 64bits addressing for instance)
  • Disk is the new tape! (fast for sequential access); rather use SSDs for random threaded access
  • Network is not slow anymore: 10GigE is now a commodity and you can have sub 10 microseconds for local hop with kernel-bypassing RDMA
  • (Not hardware, but) understand how GC and JIT work (under the hood)

  • Write Performance tests first
  • Make it run automatically and nightly to detect when you should start to optimize your code
  • Still no need for early and premature performance optimizations

  • Keep the working set in-memory (data and behaviour co-located)
  • Write cache (cache-lines synchronization) friendly code (the rebirth of the arrays ;-)
  • Choose your data structure wisely
  • Queues are awful for concurrency access, rather use Ring Buffers instead (no stress: we just said that we bought lot of memory ;-)
  • Use custom cache friendly collections
  • Write simple, clean & compact code (the JIT always do better with simpler code-shorter methods are easy to inline)
  • Invest in modeling your domain. Also respect the single responsibility principle (one class one thing, one method one thing,...) and the separation of concerns
  • Take the right approach to the concurrency. Concurrent programming is about 2 things: mutual exclusion and visibility of changes which can be implemented following two main approaches. i) a difficult locking approach (with context switch to kernel), and ii) a VERY difficult atomic with no blocking (user space) instructions (remember how are implemented optimistic locking mechanism within databases). You should definitely choose the second one
  • Keep the GC under control. Because the GC may pause your application, you should avoid it by implementing (circular buffer) preallocation and by using a huge amount of (64bits) memory
  • Run business logic on a single thread and push the concurrency in the infrastructure. Because trying to put concurrency within the business model is far too hard and easy to get wrong. You would also turn the OO programmers dream of: easy to write testable and readable code. As a consequence, it should increase your time to market.
  • Follow the disruptor pattern which is a system pattern that tries to avoid contention wherever possible (even with business logic running on a single thread)

Ok, this presentation leave us with lot of questions about how they implemented their systems. However, I found this video very refreshing and interresting regarding several points (the need for huge amount of RAM with 64 bits addressing, the interest of network kernel-passing technologies with RDMA, the fact that queues are bad to handle concurrency properly, that clean and simple code doesn't prevent you from having excellent performances, that concurrency should stay out of the business logic, that the separation of concerns allow you to have very competitive time-to-market,... and of course that you need a write-performance-tests-first approach).

For the disruptor pattern explanations and much more, see the entire video of this presentation here: http://www.infoq.com/presentations/LMAX

Note: don't see this video in full-screen mode otherwise you will loose the benefits of the slides displayed below ;-(

Cheers, and happy new year to everyone ;-)