What Building a Multi-Agent System Taught Me About Autonomous Software Development

In August 2022, before I was using AI this way, I wrote: “Computers do what we tell them to. We’re just not always clear with what we’re asking for.”

Building with AI has only made that more obvious.

In fact, it made something else clear too: even when we think we are being clear, there is still a lot of room for a system to interpret the request in ways that are technically valid and product-wise wrong.

That became obvious in the first iteration of a multi-agent system I built to turn project requests into working software. It took a project definition and moved it through a series of stages: clarification, planning, design, implementation, and validation. The idea was straightforward enough. Instead of asking one agent to do everything, the system separated responsibilities across roles and tried to enforce a more structured path from request to output.

What I learned from building it was not that multi-agent systems are inherently better. It was that most of the hard problems show up before code generation becomes the issue.

The real challenge is not getting a model to produce code. It is getting a system to interpret a request well, clarify the right things early enough, and prove that each stage has produced something the next stage can actually trust.

Detailed input is not enough

One of the first things I learned is that a detailed project definition does not guarantee a well-shaped product.

The first version of the system started with a fairly in-depth project definition. On paper, that should have reduced ambiguity. In practice, it didn’t. The system asked follow-up questions, but not to the depth needed to surface the real product decisions. It clarified some things, but it did not get to the level of emphasis and intent required to shape the product well.

Given all the preamble, the results were not entirely surprising. The system technically hit the goals, but the product was poorly designed.

The primary elements were not brought to the user’s attention clearly. Instead of focusing the interface around what mattered most, it surfaced too much detailed information. It was as if the model was inverted. Rather than organizing the experience around outcomes, it pushed detail to the foreground and left the important parts underemphasized.

That mattered because the failure was not in code generation. The code reflected the system’s interpretation. The problem was that the interpretation was wrong in the ways that mattered most.

What building with AI made much more obvious is how quickly weak definition turns into visible product failure. In a traditional workflow, that pain might take weeks or months to surface. With AI, it shows up within hours, sometimes minutes. The system starts building on your ambiguity immediately. It hardens half-formed priorities into interface choices, workflow decisions, and implementation direction. The issue is not new. The speed is.

A system can have a detailed input and still miss the product.

Product definition is where the system needs to slow down

This is where the PM role became much more important in practice.

A project definition was required from the start. That was not enough. What changed the quality of the system was how much work it did before treating that definition as sufficient.

When the agent moved too quickly past this stage, downstream phases inherited unresolved assumptions about priority, intent, and what the product was actually supposed to optimize for. When it spent more time here asking better questions, surfacing missing context, and making its interpretation visible, the rest of the pipeline got stronger.

That is what the PM role was really doing. Not just collecting inputs, but forcing clarity before architecture and implementation started building on top of incomplete thinking.

If that stage is weak, every later stage pays for it. The architect designs around assumptions that should have been resolved earlier. The coding stage turns those assumptions into artifacts. QA ends up trying to catch problems that were already embedded upstream.

The sooner the system surfaces missing context and tests intent, the less waste it creates.

A single loop is good for interactive development. It is not enough for autonomous development.

One thing building this system clarified for me is that a single conversational loop can work very well for interactive development.

That is how I often work myself. I go back and forth with ChatGPT when I am thinking through product and architecture. I use that space to stay at the right level of abstraction, challenge decisions, and shape the system before implementation starts. Then I generate prompts for Claude or Cursor to do the actual work. During development, those tools can confirm the direction, expose issues, or push back on decisions that do not hold up once they meet the code.

Product and architecture discussions stay focused on intent, structure, and tradeoffs. Implementation happens in a different context, where the job is to build against those decisions rather than redefine them accidentally. That keeps the process more disciplined and makes it easier to see when something has drifted.

The same distinction matters even more if the goal is autonomy.

A single loop can be effective when a human is continuously present to clarify, redirect, and correct. But in an autonomous system, that same structure becomes much weaker. Planning, design, coding, and validation start collapsing into one stream. Product assumptions get mixed with implementation decisions. Validation becomes reactive. The system can keep moving, but it becomes harder to tell whether it is still building the right thing or just building coherently.

To get to autonomy without losing control, the separation still has to exist. The system needs specialized roles that can validate input, push back when something is underdefined or inconsistent, and produce outputs that are actually ready for the next stage. That structure keeps the system focused. It prevents implementation momentum from quietly taking over product and architecture decisions. And it creates a more trustworthy path from request to result.

The value of multiple agents is not that there are more of them. It’s that they preserve the boundaries that interactive development already relies on.

Stages need handoff criteria, not just roles

The first iteration also made something else clear: assigning roles is not enough.

At every transition, the system needs to answer two questions: what counts as done here, and what evidence makes this ready for the next stage?

Without that, the pipeline looks structured but behaves loosely. Work moves forward because something was produced, not because it is actually sufficient.

I saw that when the coding agent wrote code but did not build the system. That should have been a clear failure at the boundary between implementation and QA. Writing code and producing a buildable artifact are not the same outcome. But the next stage let it pass.

That exposed the deeper issue. The problem was not just that QA missed something. The problem was that the contract between stages was weak.

If one stage does not know exactly what it is receiving, what it is responsible for checking, and what “ready” means, the system is not controlled. It is optimistic.

That is why clear roles are not enough. Each stage needs explicit expectations, explicit outputs, and explicit evidence for handoff. Otherwise the structure is mostly cosmetic.

What building this changed for me

Going into this, I expected the first iteration to be a bit of a horror show. I thought it would get things so wrong that it would mostly be useful as a catalog of failure. It was not good, but it was not horrible either.

What surprised me was that I could still see the product through the mistakes. The system made enough poor decisions to prove that the structure needed work, but it was directionally close enough that I could see the path forward.

The product I had in mind, which seemed clear on paper but turned out to require much better clarification in practice, no longer felt out of reach. It felt like something a multi-agent system could produce, if the system got better. That is why I am already building the next version. I’ll let you know how it turns out.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.