It started as a late-night experiment. I was frustrated by a recurring problem: I'd ask an LLM a technical question, get a confident answer, ship it, and discover a subtle bug two days later. The model wasn't wrong exactly -- it just had blind spots. Every model does. The question was whether those blind spots overlap.
Spoiler: they don't. And that insight became the foundation for something I call the LLM Council.
The concept is borrowed from ensemble methods in traditional machine learning. A random forest outperforms a single decision tree not because any individual tree is brilliant, but because their errors are uncorrelated. What if the same principle applied to large language models?
The hypothesis: if you send the same query to multiple LLMs, have them review each other's answers anonymously, and then synthesize the results, the output should be more reliable than any single model's response.
So I built it. A system that fans out a query to 5+ models via OpenRouter, collects their responses, runs an anonymous peer review round, and has a "Chairman" model synthesize the final answer. The whole pipeline takes 15-30 seconds depending on the models used, but the quality improvement is dramatic.
The system has three stages. Here's the core orchestration logic:
async def run_council(query: str, models: list[str]) -> CouncilResult:
# Stage 1: Fan-out -- all models answer in parallel
responses = await asyncio.gather(*[
query_model(model, query) for model in models
])
# Stage 2: Anonymous review -- each model reviews all others
reviews = []
for reviewer_idx, reviewer in enumerate(models):
other_responses = [
{"id": f"Response {chr(65+i)}", "content": r.content}
for i, r in enumerate(responses) if i != reviewer_idx
]
review = await query_model(reviewer, build_review_prompt(
original_query=query,
responses=other_responses
))
reviews.append(review)
# Stage 3: Chairman synthesizes
chairman = "anthropic/claude-sonnet-4"
final = await query_model(chairman, build_synthesis_prompt(
query=query,
responses=responses,
reviews=reviews
))
return CouncilResult(final=final, responses=responses, reviews=reviews)
The critical design decision is in Stage 2: anonymous review. Models don't know which model produced which response. They see "Response A," "Response B," etc. This prevents brand bias -- models tend to defer to responses they suspect came from a "stronger" model. Anonymity forces them to evaluate on content alone.
Each reviewer scores the other responses on three dimensions: correctness, completeness, and clarity. The scoring prompt is deliberately structured to force differentiation:
REVIEW_PROMPT = """You are reviewing {n} responses to the following query:
"{query}"
For each response, score on a 1-5 scale:
- CORRECTNESS: Are the facts and logic accurate?
- COMPLETENESS: Does it address all aspects of the query?
- CLARITY: Is it well-structured and easy to follow?
You MUST assign different scores. No ties allowed.
Identify specific errors or omissions in each response.
Rank the responses from best to worst with justification."""
The "no ties" rule is important. Without it, models tend to rate everything 4/5 to avoid conflict. Forcing differentiation produces much more useful signal for the synthesis stage.
After running hundreds of queries through the council, clear personality patterns emerged across models. These aren't anthropomorphic projections -- they're measurable tendencies in how models approach the same problem.
Claude (Sonnet/Opus) is the cautious analyst. It hedges. It qualifies. It says "it depends" a lot. During review, Claude is the most likely to flag edge cases that others missed. Its weakness: sometimes the hedging obscures a clear answer. When Claude is confident, pay attention -- it usually means the answer is solid.
GPT-4 is the confident generalist. It gives direct, well-structured answers and rarely hedges. During review, it's the harshest critic -- quick to identify logical flaws. Its weakness: overconfidence. When GPT-4 is wrong, it's wrong with full conviction, which can mislead the synthesis stage if not caught by other reviewers.
Gemini is the creative diverger. It often approaches problems from unexpected angles and surfaces considerations others miss entirely. During review, it provides the most detailed critiques. Its weakness: verbosity and occasional tangents. It sometimes brings up valid but irrelevant points that dilute focus.
Llama/Mixtral (open source) are the practical implementers. They tend to jump straight to code or concrete steps. Less philosophical, more hands-on. During review, they catch implementation-level issues that the commercial models gloss over. Their weakness: they occasionally miss higher-level architectural concerns.
Three findings genuinely surprised me:
1. The council catches errors no single model would. In a test with 50 technical questions where I knew the correct answer, the best individual model (Claude Opus) scored 82% accuracy. The council scored 94%. The improvement came almost entirely from the review stage -- models catching each other's mistakes.
2. Minority opinions are often right. In about 15% of cases, one model disagreed with the majority -- and the dissenter was correct. This is why the chairman model is instructed to weigh minority opinions carefully rather than just going with consensus. Majority voting alone would have missed these cases.
3. Model agreement is a confidence signal. When all 5 models independently produce the same answer, it's correct 99% of the time. When they diverge significantly, it usually means the question is genuinely ambiguous or requires domain expertise none of them have. The degree of agreement is itself valuable metadata.
I ran an A/B test: council with anonymous review vs. council with attributed review (where models knew which model produced each response). The results were stark.
With attribution, models deferred to Claude and GPT-4 responses roughly 40% more often, regardless of quality. Gemini and open-source model responses were rated lower even when they contained the correct answer. The brand bias was measurable and consistent.
Anonymous review eliminated this entirely. Responses were evaluated on their merits. This mirrors a well-known finding in human peer review: double-blind review produces fairer evaluations than single-blind. Turns out LLMs have the same bias.
"The best answer doesn't always come from the biggest model. But you'll never discover that if you let the models see each other's name tags."
The council pattern has applications far beyond question-answering. I'm currently exploring two:
Automated code review. Fan out a code diff to multiple models, have each review independently, synthesize into a comprehensive review. Early results show the council catches 2-3x more issues than any single model, with fewer false positives because the review stage filters out nitpicks that only one model cares about.
Content verification. For any AI-generated content that matters -- documentation, technical writing, customer communications -- the council acts as a quality gate. Three models generate, two models review, one synthesizes. The output is more accurate and more balanced than single-model generation.
The LLM Council isn't a product. It's a pattern -- one that acknowledges a fundamental truth about the current state of AI: no single model is reliable enough for high-stakes decisions. But an ensemble of models, reviewing each other anonymously, comes remarkably close.
The cost is 5-6x a single model call. For most chat applications, that's overkill. But for decisions that matter -- code that ships to production, content that represents your brand, analysis that drives business decisions -- the council is the most cost-effective quality improvement I've found. It's cheaper than a human reviewer and available 24/7.
If you're building AI systems where accuracy matters more than speed, stop asking one model and start convening a council.