Designing an Effective Multi-Agent System: a Hierarchical Two-Pizza Approach

The more time I spend building and optimizing multi-agent systems, the more convinced I am that human organizational effectiveness principles still apply even though the agents are artificial.

A few of the principles that I've found to be especially critical are:

The manager owns the outcome: Create a manager agent that's responsible for achieving the ultimate outcome for the team. The manager agent should be able to delegate tasks to other agents, evaluate their performance, and coordinate the overall outcome.
Keep the team small, with a single-threaded manager agent (The Two-Pizza Rule): If your outcome requires collaboration from more than ~7 AI agents, you need to break it into smaller chunks.
Show me the incentive and I'll show you the outcome: Incentivize your manager agent to achieve the best possible version of the outcome, not just to complete the task.
Limit external dependencies: If your system only works with a specific framework or platform, you're limiting your future scale and ability to productionalize your agents.

Note: This post presumes a general awareness of agents, tool calling, and LLMs. If you're not familiar with these concepts, I recommend reading this post first.

AI Agents having pizza — The Two-Pizza Team concept: If you can't feed the team with two pizzas, it's too big

Let's dive into each of these principles and why they are important in multi-agent systems. Each section includes a rule of thumb that can help you avoid mistakes I've made in the past when building multi-agent systems. We'll use a representative example of a multi-agent system that is tasked with creating a preliminary investment memo for a potential private equity investment.

1. The Manager Owns the Outcome

There are two types of multi-agent system approaches, and CrewAI does a nice job of explaining them here.

Think of a sequential multi-agent system as an assembly line, where each agent is responsible for a specific step in the process, and then passes the output to the next agent. A hierarchical multi-agent system, on the other hand, puts the manager agent in charge of delegating tasks to specialized worker agents and coordinating the overall outcome.

Manager agent with multiple worker agents — Example of a hierarchical multi-agent system, with agents in blue and tool calls in purple

I primarily focus on automation of human tasks, and for these, I've found that the hierarchical manager / worker paradigm is the most effective. It enables a more dynamic and adaptable process, where the manager can make strategic decisions about the process and even prompt the worker agents to fix or re-calibrate their output if needed.

The manager agent acts as the orchestrator of the process, responsible for:

Delegating tasks to specialized worker agents
Ensuring quality of the final output
Coordinating between different agents
Making strategic decisions about the process, especially when to re-prompt a worker agent to fix or re-calibrate their output

Let's analyze the different approaches through the lens of the investment memo example.

Sequential Approach

Each agent is incentivized to complete their task and hand off the output to the next agent.
If an agent's output is not good enough (i.e. if it is missing critical context), there's no check or balance that can re-prompt them to fix or improve their output.
The last agent in the sequence (the one responsible for the "final output" tool call) is at the mercy of the information provided by the previous agents, without recourse.
If agent 3 in the sequence doesn't get the context needed to accomplish their task, they will have to make a guess and hope for the best. This often leads to hallucinations and incorrect outputs.

And on the other hand:

Hierarchical Approach

The manager agent is responsible for coordinating the overall outcome
If agent 3 in the sequence finds something that changes the relevance of the information provided by agent 2, the manager agent can re-prompt agent 2 to update their output
The manager agent ultimately decides when the process is complete and the template hydration tool call is made
The manager agent can also choose to escalate the issue to a human rather than finishing the process, via a different tool call

In addition to these process nuances, the hierarchical approach also allows you to:

distinguish between the primary thread (the manager agent's thread) and the secondary threads (the subconversations between the worker agents and the manager agent)
manage token windows more effectively, because the secondary threads are not counted against the primary thread's token window

For example, let's assume our agents are running on Claude 3.5 Sonnet, which has a 200k token window. In a sequential approach, because the conversation is a single thread, the overall working context cannot exceed 200k tokens... so information from the beginning of the workflow could be lost by the time the final output is generated.

In a hierarchical approach you can configure the secondary threads to only return the final response from the worker agent, rather than each tool call and response. This allows each subconversation to reach the full 200k token window if needed, without overly inflating the primary thread's token window.

Rule of Thumb

If you're automating a task and need a consistent outcome each time, but dynamic decision making within each run to actually generate the outcome, use a hierarchical approach with a manager agent and worker agents.

2. Keep the Team Small, with a Single-Threaded Manager Agent (The Two-Pizza Rule)

The idea of a "two-pizza team" comes from Jeff Bezos at Amazon, who famously made a rule that every internal team should be small enough to feed with two pizzas. Another key management technique for those two-pizza teams is to have a single-threaded manager, who's role is "vital in instituting the right level of oversight to retain a team's empowerment to innovate independently on behalf of their customers". You can read more about it here.

The most important aspect of a two-pizza agent team is that it is small and nimble, and that individual ownership is clear. The most common failure mode I've observed in multi-agent systems is a lack of focus on specific tasks and responsibilities for each agent.

Again it's useful to think back to human organization design, and how we would structure teams of analysts or other early career professionals... Which of these two human teams would you expect to perform better?

Team 1: Generalists

Six general analysts
Each has access to the same 10 data sources
Each independently evaluates the data and draws conclusions without specific expertise or instructions

Team 2: Specialists

1 content writer: writes the final memo
1 financial modeler: builds the financial model for the investment
1 data analyst: expertise in SQL and python
1 industry analyst: only accesses industry reports and company-specific data

You might be thinking, "why does this matter for AI Agents? they are naturally generalists because the foundational models are generalists".

This is actually a key part of the problem. Simple agents without restricted tool access or system prompt guidance will act like they know a little bit about everything, because they do. They won't go deep on one focus area, but instead will try to do it all.

Here are some key questions to ask yourself when determining how to split up your overall outcome into tasks and agents:

Can you distill your manager agent's incentive into a single overarching objective with success criteria? (this should be the first thing in its system prompt)
Can each worker agent's incentive also be described with a single objective?
Go through each worker agent's objective... could a single human analyst complete that objective if he/she had the necessary training?
Can each worker agent's objective be accomplished with less than 10 unique tools?
Can each worker agent's objective be accomplished with less than 25 total tool calls (using any of its unique tools)?
Are there less than 7 worker agents?

Rule of Thumb

Design tool calls that are unique to each agent, even if they are very similar to each other. This will help you avoid the "too helpful" analyst problem. Don't ask more than 7 agents to work on a single task.

3. Show me the incentive and I'll show you the outcome

Charlie Munger probably wasn't thinking about multi-agent systems when he said this, but it absolutely applies.

Agents are powered by LLMs, and LLMs are RLHF biased into submission kind of like Chick-fil-a employees. Everything is their pleasure. They always want to be more helpful and do more work.

You can use this attribute to your advantage, or your detriment. If your prompting leads the agent to think that finishing the task is the end all / be all, they will be singularly focused on that outcome. At the expense of the quality of the outcome.

On the other hand, if you incentivize the agent to do a good job and help explain what that actually means in your specific context, they will be more likely to repeat steps, adjust their plan, and generally optimize each step of the process to generate the best possible outcome.

Rule of Thumb

Teach your manager agent what good looks like and what the common failure modes are. Incentivize them to achieve the best possible outcome each time. You don't have to offer a raise or even stock options - just explain how their output will be measured and evaluated.

4. Limit External Dependencies

As outlined above, CrewAI has a good framework for designing hierarchical multi-agent systems. However, before using a framework, you should ask yourself if you are willing to accept that framework's constraints in a production environment.

I believe frameworks should be measured by their "compression ratio"... essentially, the ratio between the lines of code needed to build a thing with a framework vs. the lines of code needed to build the same thing without that framework. According to this metric, AI Agent frameworks are some of the worst bits of code to ever be called "frameworks"... I often find that doing a thing in CrewAI is actually more lines of code than doing it with the OpenAI SDK directly.

Here's how I approach the framework and architecture decision, broken out by language:

Python
Prototyping Only	Whatever gets you there fastest - CrewAI, LangGraph, OpenAI / Anthropic SDK all valid options
Internal Production	OpenAI and/or Anthropic SDK with Postgres for persistence (Have Cursor write a simple function to convert message schema from/to OpenAI and Anthropic if using both)
Customer Facing API	OpenAI SDK, Anthropic SDK, Postgres, and then either FastAPI if running on a container or AWS Lambda with Step Functions if running on AWS serverless

Javascript / Typescript
Prototyping Only	Vercel AI SDK
Internal Production	Vercel AI SDK + Postgres + Vercel
Customer Facing API	Vercel AI SDK + Postgres + Next.js or Vercel serverless functions / AWS Lambda functions

The Vercel AI SDK isn't an Agent framework but it's the ideal abstraction layer:

makes it easy to support multiple LLM providers (OpenAI, Anthropic, etc.)
doesn't interfere with tool call execution (can still use your own tool execution API or write tool calls in JavaScript)
doesn't interfere with security (can still use your own authentication and authorization; can support Bring your own API Keys for users or run off a single service key)
doesn't interfere with message structure or persistence (can still use your own database)

I just wish there was an equivalent for Python...

Rule of Thumb

Don't make a long-term commitment for a short-term solution. And don't sleep on JavaScript.

Final Thoughts

Designing a productive multi-agent AI system is really about applying timeless principles of efficient team organization:

Keep the group small
Define clear roles and success criteria
Equip everyone with exactly what they need, but don't overwhelm them
Make sure technology is an enabler, not a limitation

You don't have to re-invent the wheel of how you'd set a team up for success. If you're looking for a way to test out some of these ideas, check out Aster Agents - you can build a multi-agent system in minutes and see how it works in production.