Many Clients, One Context – Share Without Sharing

At Maven we need low-latency. We need to react to the changing market quickly so that when interesting things happen – and they do – we can make split-second decisions to make a profit or to save ourselves from loss.

We use C++ to achieve this, among other tools. It gives us what we need: the ability to write code which is high-level and which retains a significant amount of control over the machine.

Writing code that is low-latency is not easy. It requires careful planning, making trade-offs and simply marking entire parts of the language as off-limits for latency-critical code. We have to remain vigilant because a carefully crafted market handler application can easily descend into unmaintainable and unintelligible spaghetti-code.

So how do we keep our code both maintainable and fast? There are a few good guidelines to achieving this which we won’t dwell on here but the main, and perhaps most important one is this:

Complexity minimization: if you don’t need it, don’t do it

Is it really that important? Let’s use an example. At Maven we often end up with code which looks like this…

template<typename NextT>
struct FooHandler : private NextT
{
    struct FooContext { /* ... */ };
    SomeHashMap<OrderIdType, FooContext> mContexts;

    void handle(tags::OrderModified tag,
                OrderIdType orderId,
                OrderModifiedDetail const& detail)
    {
        // Lookup the context from the order ID.
        FooContext& ourContext = mContexts[orderId];

        // Perform some action which updates the context.
        doFooTriggers(orderId, ourContext, detail);

        // Call the next handler.
        NextT::handle(tag, orderId, detail);
    }
};

template<typename NextT>
struct BarHandler : private NextT
{
    struct BarContext { /* ... */ };
    OtherHashMap<OrderIdType, BarContext> mContexts;

    void handle(tags::OrderModified tag,
                OrderIdType orderId,
                OrderModifiedDetail const& detail)
    {
        // Lookup the context from the order ID.
        BarContext& ourContext = mContexts[orderId];

        // Perform some action which updates the context.
        doBarTriggers(orderId, ourContext, detail);

        // Call the next handler.
        NextT::handle(tag, orderId, detail);
    }
};

struct TerminalHandler
{
    void handle(auto tag, auto&&... args) { }
};

using ComposedHandler = FooHandler< BarHandler< TerminalHandler > >;

void marketLoop()
{
    ComposedHandler handler;
    while (marketEvents.hasNext())
    {
        auto const& [id, detail] = marketEvents.next();
        handler.handle(tags::OrderModified{}, id, detail);
    }
}

This is a simplification but the idea is that we keep our code maintainable by separating responsibilities. FooHandler and BarHandler each have their own distinct purpose, logic and data and each handler should be trivially replaceable. (In real code we might end up with a graph containing hundreds of handlers, so this separation really is important.)

Can you spot the inefficiency? There are no virtual functions here; marketLoop calls FooHandler::handle calls BarHandler::handle calls TerminalHandler::handle which returns and we start again. If we used an object pool or custom allocators with our storage – as we should – then no heap allocations would be necessary either. But notice the innocent looking mContexts[orderId]. It’s a hash table lookup and we do this not once but twice.

Two or more hash lookups might not sound bad until you realise that each and every one of them may theoretically involve:

auto hash = std::hash<OrderIdType>{}();
calculation of a hash code;
auto& bucket = buckets[hash & ~bucketMask];
a mask or modulus to convert the hash code into a bucket index;
auto iter = std::find(std::begin(bucket), std::end(bucket), entry);
if the hash is bad quality, a linear scan through the hash map’s bucket entries with each step invoking operator==() to check if it’s found its quarry (“Linear Probing”);
*iter one or more iterator dereferences or memory lookups where each could involve a cache-miss.

Yes, a hash table lookup is supposed to be an amortized constant – meaning something like Big-O notation’s O(1) – but that constant could be quite large. Furthermore, in our example of two lookups there’s no guarantee that each hash map’s content lies side-by-side in memory. These could in fact be megabytes apart.

So can we do better? Let’s introduce an OrderLookupHandler into the chain, just before FooHandler.

template<typename NextT>
struct OrderLookupHandler : private NextT
{
    using FooContext = NextT::FooContext;
    using BarContext = NextT::NextT::BarContext;
    using UnifiedContext = std::tuple<FooContext, BarContext>;

    SomeHashMap<OrderIdType, UnifiedContext> mContexts;

    void handle(tags::OrderModified tag,
                OrderIdType orderId,
                OrderModifiedDetail const& detail)
    {
        // Lookup the context from the order ID.
        UnifiedContext& context = mContexts[orderId];

        // Call the next handler with the additional context.
        NextT::handle(tag, orderId, detail, context);
    }
};

template<typename NextT_>
struct UnifiedFooHandler : private NextT_
{
    using NextT = NextT_;  // Exposes next step.
    struct FooContext { /* no longer private to UnifiedFooHandler */ };

    void handle(tags::OrderModified tag,
                OrderIdType orderId,
                OrderModifiedDetail const& detail,
                auto& unifiedContext)  // <-- it doesn't know the type
    {
        // Lookup our part of the context.
        FooContext& ourContext = std::get<FooContext>(unifiedContext);

        // Perform some action which updates the context.
        doFooTriggers(orderId, ourContext, detail);

        // Call the next handler.
        NextT::handle(tag, orderId, detail, unifiedContext);
    }
};
//...

Now each downstream handle method takes the additional argument and – hey presto! – we turn two hash map lookups into one. But there are at least two downsides to this approach:

We have a tightly-coupled OrderLookupHandler which now needs to know about the concrete types of FooContext and BarContext, and
Each downstream handler also needs these concrete types or must accept an anonymous or constrained auto&

This works but it isn’t as maintainable as it could be. We could create a concept and constrain our auto& argument or just be lazy and pass auto&&. But, auto&& arguments allow any type to be passed whether it’s appropriate or not and inappropriate types generate horrible compilation errors!

Can we do better? Can we make the context type concrete and stop other handlers from looking into our private data? Yes! If each of the handlers and TerminalHandler include their own ContextType then we can do this…

template<typename NextT>
struct BestOrderLookupHandler : private NextT
{
    using ContextType = NextT::ContextType;

    SomeHashMap<OrderIdType, ContextType> mContexts;

    void handle(tags::OrderModified tag,
                OrderIdType orderId,
                OrderModifiedDetail const& detail)
    {
        // Lookup the context from the order ID.
        ContextType& context = mContexts[orderId];

        // Call the next handler with the additional context.
        NextT::handle(tag, orderId, detail, context);
    }
};

template<typename NextT>
struct BestFooHandler : private NextT
{
    class FooContext : public NextT::ContextType
    {
        friend class BestFooHandler;

        /* private to BestFooHandler */
    };

    void handle(tags::OrderModified tag,
                OrderIdType orderId,
                OrderModifiedDetail const& detail,
                FooContext& ourContext)
    {
        // ...

        // Calls to the next handler will implicitly upcast
        // as static_cast<NextT::ContextType>(ourContext)
        NextT::handle(tag, orderId, detail, ourContext);
    }

    using ContextType = FooContext;
};

template<typename NextT>
struct BestBarHandler : private NextT
{
    class BarContext : public NextT::ContextType
    {
        friend class BestBarHandler;

        /* private to BestBarHandler */
    };

    void handle(tags::OrderModified tag,
                OrderIdType orderId,
                OrderModifiedDetail const& detail,
                BarContext& ourContext)
    {
        // ...

        // Calls to the next handler will implicitly upcast
        // as static_cast<NextT::ContextType>(ourContext)
        NextT::handle(tag, orderId, detail, ourContext);
    }

    using ContextType = BarContext;
};

struct BestTerminalHandler
{
    struct ContextType { /* intentionally empty */ };

    void handle(tags::OrderModified tag,
                OrderIdType orderId,
                OrderModifiedDetail const& detail,
                ContextType& ourContext)
    {
        static_assert(sizeof(ourContext) == sizeof(ContextType),
                      "Our context is present and empty!");
    }
};

Ta-daa! Now each handler has its own private context which it can mutate as it likes without worrying about other handlers messing with it or reading its secret contents. What is more, this way is fast. All the contexts now exploit spatial cache locality; they sit side-by-side making them very cache-friendly. So, the memory lookup is very likely paid for once and only once.

What is more, if we want to optimize the context-owning hash map then we only need to optimize this in one place. We could replace our HashMap in the OrderLookupHandler with a faster hash map or even an array and everyone downstream would immediately benefit.

As said earlier, programming with low-latency has its own challenges. But, these same challenges can lead to different approaches, interesting solutions and be very rewarding for the programmer.

Thanks for reading.