Back
Life at Optiver  · 

Simple Designs

This series of posts first examined the importance of the CME to Optiver’s trading activities and the implications of their iLink architecture upgrade. We next examined our high-level approach: combine a thorough understanding of the problem with a disciplined scientific approach. Today we want to dive deeper and examine how we strive for simplicity in our system design and architecture.

To start, we aim to socialize our designs within our technology group. We spend a lot of time seeking feedback from other engineers within our group. We discuss them in meetings. We spend time at whiteboards with traders, developers, and operations engineers. We conduct design reviews with a diverse group of senior engineers from all parts of our technology group. Our goal in all of this is to beat up a design as much as possible in the hopes of exposing any flaws and finding areas to improve.

In the last post I mentioned our CTO’s challenge to design our FPGA with only one risk limit check. One of our most tenured engineers came up with a very clever way of achieving this goal. However, when we exposed this to a broad group of senior engineers in whiteboard discussions and a design review, we arrived at a different consensus. It would be far simpler for both the clients of the FPGA and the FPGA itself to use two counters rather than one. The group determined that the added complexity to the FPGA design was minimal enough that the broader reduction in systemic complexity was far more important. We arrived at a simple design by starting from an extreme constraint, designing a solution, and making some iterative improvements via a larger group discussion about the benefits and drawbacks of various proposals.

Crucial to these discussions being healthy conversations and not continual religious war is that we operate from a set of shared technical principles. As we stated in our introductory post, this does not mean we follow these principles dogmatically. But it does mean we start from the same place, and thus speak the same “technical language”. We still have strong, differing opinions on the best solution. But we always strive to learn from each other.

All of this begs the question: what are these technical principles? What follows are three principles which apply to many of our discussions.

No Future Proofing

In our last post we mentioned a couple of self-imposed constraints. Avoiding future proofing is a broad class of constraints we enforce on ourselves. Future proofing often results in a generic system, so we aim to avoid it. Our reason is that we have found that highly specific systems are easier to code, more robust, and generally simpler than generic systems that can solve a variety of problems. Some examples of this in practice:

  • Do not start with a generic protocol, especially if you only have one concrete use case.
  • Do not build in hooks to support multiple order types when you will only support one in your first release.
  • Avoid optional and nullable fields in protocols, objects, and databases.

In the case of the iLink upgrade, our very first FPGA deployment was merely a TCP passthrough. We then only solved for a subset of the trading signals, order types, and strategies that our old system traded. And in building our system we strictly avoided anticipatory abstractions, templates, and generic structures in favor of specific code tailored to the problem at hand.

Fail Hard and Loud

We initially code our systems to crash loudly when they encounter unexpected situations. We are even open to embracing single points of failure which will bring down a host of other applications when they fail. In addition to being easier to implement, it gives us deterministic behavior in the midst of unanticipated situations. We can leverage this environmental invariant to bring “the humans”, who tend to be better than computers at handling vague and ambiguous situations, to come and intervene. We reinforce this principle by eschewing automatic restarts and keeping simple reset buttons out of the hands of our traders.

Our new FPGA-based system was no different. In fact, failures became even harder and louder as there were more dependencies, single points of failure, and cross-application state transitions than in our previous system. Leaning on a hard-fail mentality quickly surfaced problems. And because we had not future-proofed and focused on small, well-contained part of the trading problem, a hard failure in our new system was not as catastrophic to our overall trading activities as it may have been otherwise

No Tolerance for Errors

Tightly related to the previous point, we have very little tolerance for errors in our system. We have long had an instinct that there is a lot of value to be gained by paying attention when your system logs an error, or more generally does something unexpected. As we have put this philosophy into practice over eight years we have seen it consistently reveal larger problems than we expected.

  • No Errors in Logs: Diligently examining the errors in our logs has had a massive positive impact on our trading system. We have discovered misconfigurations, uncovered dead code, revealed opportunities for major architectural improvements, and stumbled upon strategic evolutions which improved our trading.
  • No Dropped Network Packets: Our continual fight against dropped packets has born much fruit over the years. We have found bugs in network drivers, performance problems in desktop trading applications, idiosyncratic networking differences between allegedly similar versions of operating systems, and much more.
  • Deterministic Latency: Paying close attention to latency drifts has shown us everything from one-line bugs to important architectural limitations which only surfaced under specific, unanticipated use patterns.

At first glance this principle appears more operational than architectural. In fact, it is both. To have no tolerance for errors, you must first have a definition for how you expect your system to behave. So we make specific definitions of intent a first class concept in system design. Which message types do you normally expect to receive from the exchange? What range of values do you expect in each field? What is maximum size of your incoming data buffer and what does it mean when that buffer approaches being full? What is the expected shape and magnitude of a response latency graph?

When designing a new system, you have the opportunity to start with a clean slate and quickly learn from situations where reality does not match your intent. In our preparations we worked to define a number of expected behaviors in our new system. As we rolled out our system, we leveraged a lack of future proofing to narrow the scope of our problem and build tailored, specific solutions. We knew if there was a problem, our system would crash and we could begin investigating the situation immediately. We could then leverage both these facts to define myriad expected behaviors. Whenever those expectations were violated, we would learn from the violation and evolve our system accordingly. Generic, resilient solutions make this sort of approach nearly impossible. There are simply too many possibilities to enumerate, and the mechanics of monitoring for all those possibilities in a system that runs continuously requires an overwhelming degree of sophistication.

When you start from these three simple principles, you are more likely to build a simply designed system. When you show that design to other smart engineers and genuinely seek their thoughts and input, you are more likely to produce a design others think is simple as well.

David Kent, Chief of Staff – Technology

David is a Stanford Computer Science alum and spent several years as a developer at Amazon.com. He joined Optiver as a Software Engineering Lead in 2009 and has led many of Optiver’s software development teams. He is presently Chief of Staff for the Optiver US Technology Group.

Life at Optiver
Insights

Related Articles

  • Life at Optiver

    Insight to action: The world of equity analysts at a market maker

    Investment acumen meets instinct In the ever-evolving world of the capital markets, the role of Equity Analyst stands out as a goal for those with a penchant for curiosity, analysis and investment acumen. The position is not just coveted for its intellectual rigor and the pivotal role it plays in investment decisions. Essentially, it provides […]

    Learn more
    Americas
  • Experienced, Life at Optiver, Technology

    Behind the scenes: Engineering Optiver’s global trading network

    Optiver's global trading network is a marvel of engineering, ensuring rapid and reliable data transmission essential for electronic trading. Network Engineer Ryan Bennett reveals how dedicated fibre optic cables and meticulous route planning maintain Optiver's competitive edge. Despite challenges like geographical hurdles and fibre cuts, the network's resilience and continuous improvement keep Optiver at the forefront of trading innovation.

    Learn more
    Europe, Global
  • Experienced, Life at Optiver

    Risk and reward within a dynamic trading firm: Insights from Optiver’s CRO Europe

    In business, risk management is often thought of as a of back-office support function—the department generally responsible for steering a company away from pitfalls and worse-case scenarios with cautionary, arms-length advice. Not at Optiver. In our high-stakes trading firm environment, it’s a core discipline that directly impacts the success of daily trading operations. As Optiver […]

    Learn more
    Global
  • Nicolas_Infrastructure_as_code
    Series
    Experienced, Life at Optiver, Technology

    Navigating Infrastructure as Code (IaC) in a non-cloud trading environment

    In the high-performance landscape of algorithmic trading, technological infrastructure isn't just important—it's critical. While Infrastructure as Code (IaC) is a well-established practice in cloud-based solutions, its application in non-cloud environments presents unique challenges, especially in latency-sensitive environments like ours at Optiver.

    Learn more
    Global
  • Series
    Life at Optiver

    From ideation to production: US tech intern summer projects

    Foreword by US CTO, Alex Itkin One of the most exciting parts of summer at Optiver is hosting the ever growing intern cohort. This summer in the US alone we had 35 interns working across our software, hardware and trading infrastructure teams. The goal of the internship is to give students an opportunity to spend […]

    Learn more
    Americas
  • Series
    Life at Optiver

    Tech intern projects at Optiver Amsterdam

    This summer, Optiver’s Amsterdam office hosted a group of tech interns eager to tackle the challenges of market making. Beyond just theory, they worked hands-on with our core trading technologies, directly engaging with some of the most interesting technical challenges in the financial industry.  In this blog post, four of our Software Engineering interns delve […]

    Learn more