Software Reliability at Optiver: People
A “perfect” process and pithy principles only go so far. The pursuit of software reliability must be lived and breathed by the entire organization- especially in the pursuit of a complex endeavor like software engineering in a dynamic environment like financial technology. It cannot be encoded in a strict process or delegated to a single team.
The software development team is responsible for reliability
Assuring the reliability of software is best achieved by injecting reliability concerns directly into the software development process itself, not by establishing software verification as an isolated step at the end of the development cycle. This is because software reliability and system design are inextricably linked; decoupling the two fundamentally undermines the former. As a result, we believe that assuring software reliability is the responsibility of the development team, not that of a distinct person, team or role.
Creating a distinct Quality Assurance or Software Tester role (or team) devalues the input that the developers can have. They are the people who can best understand what can go wrong with a system and how errors might arise.
Distinguishing development from testing roles also undermines the importance the former group ought to place on correctness. This runs contrary to one of an engineer’s core obligations, which is to ensure the system behaves correctly. It also opens the door to the evolution of designs that are simultaneously error-prone and less amenable to rigorous testing.
One of the more common reasons for establishing a separate testing team or testing role is that a second and independent person thinking about and checking a system can improve the chances of finding problems. While we acknowledge that the involvement of more than one person can improve software quality, we feel this is best achieved at the design and build stage, and by means of intensive code and design review – from other developers.
Culture matters
We depend critically on a team process and an office culture that understands and facilitates continued improvement on the software reliability front. This applies not just to the development team, but to other stakeholders such as traders, operations engineers, and other users of our software. Those stakeholders need to understand the importance of software reliability concerns, and know those concerns are an integral part of the development process. Indeed, they need to encourage and proactively support software engineers in this area.
In our day-to-day work we try to record every unexpected occurrence in our environment as a “Production Event”. These run the gamut from a bad keyboard to a major outage of our trading system. We dive into each of these incidents and try to understand the root cause. We consciously avoid chalking events up to bad luck, a perfect storm, or user error.
A key element of this process is that every person involved must have high expectations for our system’s performance and the operational excellence of all who use, run, and maintain it.
As an example, recently one of our operations engineers was surprised when a normally benign user error caused more damage than expected. She dug into the full timeline of events in our system and realized this user error was only possible because of a design flaw. She suggested a design change to our software engineers and asked probing questions about its implications. In response our software engineers researched the answers to her questions and found her proposal was a positive step forward. It would make our system safer with no functional downside, so they moved ahead in implementing it.
A culture of correctness must permeate the entire organization for this approach to software reliability to succeed. In this circumstance, the operations engineer thought about how the system should work, realized it was more broken than it appeared, proposed a change, and pushed the software engineers to respond. The software engineers recognized that good system design ideas and questions can come from anywhere within the organization, respected the opinion of their colleague, and changed the system accordingly.
David Kent, Chief of Staff – Technology
David is a Stanford Computer Science alum and spent several years as a developer at Amazon.com. He joined Optiver as a Software Engineering Lead in 2009 and has led many of Optiver’s software development teams. He is presently Chief of Staff for the Optiver US Technology Group.