Software Reliability at Optiver: Design
When errors occur in our environment, we aim to learn everything we can from them so we can improve our processes, system, and culture. However, a common response to errors betrays a misunderstanding of where they come from: why wasn’t this caught in testing?
Pointing to a lack of testing betrays a presupposition that the proximate problem is the one to be solved, and that future prevention of it can be guaranteed with exhaustive testing.
Today we examine a different philosophy: architecture and design form the foundation of correctness, not testing.
Design correctness into software rather than test errors out of it
It is better to build quality into the system from the start than to “test errors out” at the end. Test-Driven Development techniques, which exemplify this mindset, require that tests be written prior to writing code, preventing errors from entering the code in the first place. The tests and the code co-evolve at a very fine granularity: testing and thinking about correctness are ideas that form part of the design and development process itself.
Perhaps the most important feature of this style of software engineering is that it inevitably leads to lightweight systems with simple designs. Without doubt, simplicity of design is the most important factor in assuring correctness of software. The converse is true also: complex designs are without doubt the most error-prone. They are also the designs that benefit the least from testing.
Architecture underpins correctness
We strive to build logical checks and protections into the software systems themselves. This mindset is best reflected by our automation risk controls, such as perimeter limits* and trade reconciliation**, components which give assurance that the system is working as intended, or at least within tolerable bounds. This way of thinking constitutes an architectural ideal, not a software development methodology. It encourages developers to build tools and systems whose sole purpose is to reduce our vulnerability to errors in other parts of the system.
A principal benefit of this approach is that it facilitates decoupling the parts that need to be correct from those that can tolerate some errors. In the ideal case, the latter parts are precisely those that we need to change frequently (e.g. the strategy-level logic); the former are the more static parts. Being static, they lend themselves to very rigorous reliability assurance techniques, and, being isolated, they allow us to focus our attention on the parts where correctness is paramount.
* Perimeter Limits: These are a variety of price, quantity, and rate limits we check just before an order is sent from our system. These checks are contained in separate, heavily tested, tightly controlled libraries which are leveraged by our automated trading strategies.
** Trade Reconciliation: We compare trades booked by our automated trading system against trades reported by an independent source to check that our record of the trades we have made matches the exchange’s record.
A great example from our system’s evolution can be seen in “trade booking”, the process we use to guarantee every trade we make is recorded. Ensuring correctness and reliability in trade booking is a key part of our foundation for protecting against automated risk. Numerous examples of catastrophic trading losses stem from trading firms being unaware of trades made by their automated systems.
Our system was originally designed such that a trading application would not be able to execute until the “trading booking” component had connected to it. To use client-server parlance, the trading application was the server accepting a connection from its client, the trade booking component. This design, an artifact of the evolutionary history of the system, made it difficult to guarantee that every trade made by our system was recorded. The trading system would need to have, as a normal, expected state of operation, one in which it is running, but unable to trade because its client has not yet connected. This requirement for an additional state was at odds with the purpose of the system, which is to submit orders and make trades. While this design can be tested for correctness, it made the system unnecessarily complex.
We decided to simplify by inverting the client-server relationship. If the trading application (now a client) could not establish an initial connection to the trade booking component (now a server), it would crash. If the trade booking component disconnected, the trading component would also crash. If an error occurred while sending a trade to the trade booking component, again, the trading component would crash. In short, the system was designed to only run if it could book trades.
This change eliminated needless complexity, reduced the number of states to be tested, and provided guarantees about trade booking, thus improving correctness and reliability. Any potential damage due to a system error is now contained, deterministic, and made known to the operations team immediately. In the case of a system error, at most we will drop a limited number of trades before the system shuts down, and a human-driven process is then used to recover.
A final note: this design decision is only effective in practice because we ensure it is universally applied throughout Optiver’s systems. The key to that universal application is the subject of our next post: people.
David Kent, Chief of Staff – Technology
David is a Stanford Computer Science alum and spent several years as a developer at Amazon.com. He joined Optiver as a Software Engineering Lead in 2009 and has led many of Optiver’s software development teams. He is presently Chief of Staff for the Optiver US Technology Group.