Engineering the Agentic SDLC

Almost everything has to change. Specs, tools, review, observability, even the way code is structured. The agent is a new kind of contributor, and the system around it must be redesigned with that in mind. We don’t expect every detail of this approach to last, but the underlying constraints have been more durable than the tooling itself.

Rethinking the SDLC for agents

The traditional SDLC assumes that a human is driving it. The natural first step to take with AI tools is to accelerate the existing human-driven process.

An agentic SDLC starts from a different assumption: agents do most of the work, and humans guide and refine. That changes requirements across the workflow. We found ourselves tightening specifications, exposing more of our tooling programmatically, and pushing failures into places where they could be detected automatically. Reviews became more focused on architecture and design decisions, while observability became part of the agent's feedback loop as much as the engineer's.

The framework below maps this into three layers: SDLC workflows (development, review, testing, rollout, operations), the agent primitives they depend on (context, harness), and the shared substrate underneath (measurement, execution, orchestration, governance).

The pressure test: connecting to a new exchange

We wanted to see whether the ideas held up in a piece of engineering work that already had well-understood constraints. To test ours, we picked one of the most stubborn but important pieces of work in our trading stack: exchange connectivity.

Every venue needs a session-management component that handles login, heartbeats and communication with the rest of our stack. The work is repetitive, but the details vary from venue to venue, and small protocol differences can be costly to get wrong. We take on that effort every time we connect to a new exchange. Our target was to get most of it done by an agent in a fraction of the time.

75% Reduction in time to market
85% Reduction in engineering effort
~90% Review-ready output

Early attempts were rough. A generic coding agent got us halfway. It couldn’t understand the component, couldn’t reliably drive our test tools, couldn’t see inside parts of the system that mattered, and couldn’t tell when its own code was wrong. So it filled the gaps with workarounds that should never have made it past review.

Most of the failures came from gaps in the surrounding workflow rather than from the model itself.

Four lessons from getting it to work

Across many end-to-end runs, the same four themes kept showing up.

01 Golden paths are engineered

"If you want a working end-to-end agentic pipeline you must reduce variance by building and engineering the golden path. There should be no luck involved."

02 Context drives outcomes

"AGENTS.md files are not the only context. If you want an agent to do deterministic work, you must provide context around the whole problem statement, not just small parts of it."

03 Backpressure reduces variance

"Lack of backpressure lets mistakes, bugs, and bad design go unchecked, and compounds the mistakes agents make as they work independently."

04 Tooling has to be agent-friendly

"Lots of our tooling is built for humans but doesn’t work very well for agents. Design for both humans and agents."

Why golden paths matter

You can’t hope the agent figures it out. Every step has to be either the obvious next move or explicitly documented as the next move. We rewrote parts of the component to remove silent defaults that turned into runtime errors, tightened the spec so the order of operations was unambiguous, and built an orchestration framework that runs the agent through phases with verification gates between each one, injecting only the context relevant to the current phase.

What context actually means

In practice, the agent needed context in three areas. Context about the system (what the component does, what its constraints are), context about the domain (how the external environment actually behaves, captured from real traffic), and context about the approach (how we tackle this kind of problem, how we test it, what good looks like). Missing any of the three produces a different failure mode.

Domain context was the most interesting. External documentation tells you what messages exist, not what happens when you send them. Tooling that lets an agent probe the live system and capture real traffic gave us a way to ground the agent in reality rather than in inference.

Backpressure as a design discipline

The agent will tell you it succeeded even when it hasn’t. The fix is to take that judgement away from it. We added application tests built from real captured traffic, made it possible to run end-to-end scenarios against a non-production environment for a clean pass/fail, and built the orchestration framework to verify each phase’s deliverables in code.

The result: the agent is free-form on the creative parts (protocol, implementation, edge cases) and tightly constrained on the deterministic parts (did it build, did tests pass, did the scenario succeed).

Designing tools for agents and humans

Many internal tools assumed a human at a terminal. Interactive log viewers, per-checkout config, manual host reservation. None of it really works with an agent. We’ve started rebuilding so agents can drive things with simple, non-interactive commands, and we’re treating “Claude can use this without help” as a requirement.

The upside: tools that are good for agents are usually better for humans too. Standardized configs, fewer prompts, pass/fail scenario commands.

Where this goes next

Three directions we’re exploring:

Generalize the pattern. The orchestration framework that drove this project was designed to be general. Apply it to other long-horizon, structured engineering problems.

Invest in the substrate. Better evals to compare iterations, isolated execution environments for safer experiments, and governance (identity, audit, policy) to grow autonomy without losing oversight.

Keep humans in the right loops. The point isn’t to remove human judgement, it’s to put it in the right places (architectural review, risk calls, novel domains) by design rather than by accident.

The exchange-connectivity project exposed constraints that had little to do with code generation itself. The quality of the outcome depended heavily on context, tooling and verification, and improving those areas had as much impact as improving the model. Because exchange connectivity is work we take on repeatedly, the impact of those changes became obvious quickly. The agent was one part of the system; a lot of the progress came from making the surrounding workflow easier to test and validate.

“Golden paths aren’t found, they’re engineered.”

Engineering the Agentic SDLC