Software Reliability at Optiver: Strategy
The first two software reliability concepts we examine concern our overall strategy for pursuing reliability. The foundation of our strategy is prudence and wisdom over brainless rule-following. No single methodology, set of best practices, or regulated rule can guarantee correctness or is appropriate in all situations.
Not all errors are created equal
The extent to which errors in different parts of the software ecosystem are liable to expose the firm to automation risk varies significantly. For example, errors in the order management logic of an automated trading system are far more likely to be dangerous than errors in a desktop tool displaying market turnover figures to the trader. We therefore explicitly admit a varied appetite for errors in our systems. This means we need not adopt the same strategy for assuring software reliability across all software components.
To be clear, admitting a tolerance for some errors does not mean we open the door to a laissez faire attitude to sloppy software development. Rather, it allows Technology leadership to appropriately focus time and resources on assuring the reliability of the most critical software components. Further, it does not mean a complete absence of reliability concerns in the remaining software systems: it merely facilitates adoption of different practices that better achieve the right balance between rapid innovation and correctness.
Testing is just part of a software reliability strategy
Especially in light of new rules and guidelines emerging from regulatory bodies, software testing is receiving a good deal of attention. We acknowledge the importance of testing, but we take the firm view that it is just one of a number of angles of attack on the software reliability problem. To be clear, building reliable and robust software has always been a challenge in the industry. To date, no known method provides firm guarantees about correctness. In particular, with a huge number of failed projects, some of them very prominent, software testing methodologies cannot claim to be uniformly successful. In short, software reliability is not a solved problem: both industry and academia are still trying to tackle it.
Nevertheless, we acknowledge that modern best practices are available, and we seek always to use them to guide our own thinking. We stress the continuity of that thinking: best practices continue to evolve, and our approaches to quality assurance in general, and to testing specifically, will change accordingly.
A great example of these principles comes from our Automated Trading Systems (ATS) engineering teams. A few years ago we were in the midst of releasing some new trading applications. We made a few successive releases with small bugs in our risk-checking code.
Our risk-checking code is one of the most critical components in our system. Bugs in this part of this system are taken very seriously and this spate of incidents was quite concerning.
To deal with this problem we took a very aggressive approach:
- For every trading strategy the ATS engineers created a list of all functionality related to risk checks. This list was then vetted by our Technical Operations team to ensure it covered the areas they expected.
- For every release, no matter how small or insignificant, functionality in this list would be tested by ATS engineers.
- The testing needed to be “end-to-end” with a production-ready version of the binary. Unit tests or tests which “mocked” functionality of the application with stub code were not sufficient.
- Each release would include an attestation that this testing was performed.
- The Technical Operations team checked the attestation to ensure the testing was completed prior to each release.
- In addition to pre-release testing, our Technical Operations team “fire drills” risk-related functionality in production to see it working in the real world. They do this on a regular basis, and especially when major functional changes are made to our systems. This gives us more assurance that the functionality of our risk-checking limits system continues to behave as expected.
- Finally, every year Optiver’s offices perform a “peer assessment”, in which a group of engineers travels to each Optiver location to perform a deep dive analysis of selected aspects of the office’s risk procedures.
As you can see, testing is only one part of a broad, company-wide, reliability strategy around risk limits.
Our approach to software reliability is quite different, however, in other areas like our user interfaces. Because there is little, if any, automated trading risk in these systems, our strategy is more accepting of errors, and has a reduced need to be all-encompassing. We still adhere to our overall development principles and practice a variety of disciplines such as TDD, code reviews, and beta testing. But the primary risk of error in these systems is loss of productivity, and thus the reliability strategy can be less all-encompassing.
David Kent, Chief of Staff – Technology
David is a Stanford Computer Science alum and spent several years as a developer at Amazon.com. He joined Optiver as a Software Engineering Lead in 2009 and has led many of Optiver’s software development teams. He is presently Chief of Staff for the Optiver US Technology Group.