Anatomy of AI Coding Agents: What 7 System Prompts Reveal About the Industry

A GitHub repository with over 116,000 stars contains the leaked system prompts, internal tool definitions, and agent loop architectures of every major AI coding tool on the market. Cursor, Windsurf, Devin, Manus, v0, Lovable, Replit, Claude Code, Copilot — all of them, laid bare.

We read all of them.

Not to gawk at proprietary secrets, but to answer a question we've been asking ourselves while building ForgeClaw: is anyone actually doing multi-agent orchestration, or is the entire industry just wrapping a single language model in a slightly different IDE?

The answer surprised us.

The Template Everyone Follows

Every system prompt we read follows the same skeleton:

1. Identity declaration  ("You are [NAME], an AI coding assistant...")
2. Tool definitions       (file read/write, search, terminal, browser)
3. Behavioral constraints  ("NEVER output code unless...", "ALWAYS use tools...")
4. Safety rails            ("Never force push", "Don't hardcode API keys")
5. Output formatting       (Markdown rules, code citation format)
6. Task management         (Todo lists, planning modes)

That's it. Whether the product costs $20/month or $500/month, whether it calls itself “agentic” or “autonomous,” the architecture is the same: one model, one context window, one system prompt, a bag of tools, and a loop.

The differences are in the margins — and those margins are revealing.

The Teardown

Cursor

Powered by GPT-4.1. The prompt is well-engineered — a “maximize context understanding” section tells the model to run multiple searches with different wording, trace every symbol to its definition, and “look past the first seemingly relevant result.” That's good advice. The tool set is standard: file CRUD, grep, glob, terminal, LSP integration (go_to_definition, go_to_references, hover_symbol), and a knowledge base for persistent memory.

What stands out: Cursor's editing model is uniquely asymmetric. The agent writes a sketch of the edit (“here's what I want to change, marked with // ... existing code ...”), and a separate, cheaper model applies the edit. This is clever engineering — it reduces output tokens on the expensive model. But it's an optimization, not an architectural innovation.

What's missing: No multi-agent coordination. No planning mode. No role specialization. It's one brain doing everything.

Windsurf (Cascade)

Markets itself as “the world's first agentic coding assistant” operating on the “revolutionary AI Flow paradigm.” The system prompt is verbose — it's one of the longer ones we read — and includes a persistent memory system where the agent proactively stores context to a database for retrieval in future sessions.

What stands out: The memory system. Windsurf explicitly tells the agent: “you have a limited context window and ALL CONVERSATION CONTEXT will be deleted. Therefore, you should create memories liberally.” That's an honest acknowledgment of the core problem. There's also a plan management system where a “plan mastermind” updates the agent's action plan through a dedicated tool.

What's missing: Despite the marketing language, the architecture is still single-agent. The “plan mastermind” isn't a separate agent — it's a tool call within the same context window. The “AI Flow paradigm” is flow of tools, not flow of agents.

Devin

The most tool-rich prompt of any we examined. Devin has shell management (multiple named shells with process control), a full editor suite, Playwright-based browser automation with DOM interaction, deployment commands (frontend to static hosting, backend to Fly.io), LSP integration, and even “pop quizzes” — injected tests that verify the agent is following instructions correctly.

What stands out: Devin's <think> tool is the most structured reflection mechanism we saw. The prompt lists ten specific situations where the agent should pause and think, including “before reporting completion to the user” and “if it's unclear whether you are working on the correct repo.” This is genuine harness engineering — the kind of thing that actually moves the reliability needle.

What's missing: Despite costing $500/month and marketing itself as “the AI software engineer,” Devin is still one agent in one loop. The find_and_edit command spawns a “separate LLM” to apply edits at regex-matched locations, but this is a utility worker, not an architectural collaborator. There is no architect reviewing Devin's plan before execution. There is no QA agent reviewing the output. It's one brain wearing every hat.

v0 (Vercel)

The most opinionated prompt. v0 defaults to Next.js App Router, Tailwind CSS v4, shadcn/ui, and includes sections on React 19.2 features like useEffectEvent and <Activity>. It knows about Next.js 16's async params/searchParams, cache components, and even has integration-specific instructions for Supabase, Neon, Stripe, xAI, and others.

What stands out: v0 has a SearchRepo tool that “launches a subagent for codebase exploration.” This is the closest any tool comes to multi-agent architecture — but it's a read-only research agent, not a collaborator. v0 also has an extensive design system section with color theory, typography rules, and layout patterns baked directly into the system prompt.

What's missing: The design rules are impressive, but they're static knowledge in a prompt — not an agent with design expertise reviewing output. The subagent is a search tool, not a team member.

Lovable

React-only. Vite-only. No Angular, no Vue, no Svelte, no Next.js. The most constrained stack of any tool we examined. Lovable's prompt is obsessively focused on beautiful design — it mandates HSL color tokens, forbids direct color usage like text-white, requires semantic design tokens, and instructs the agent to “wow the user with beautiful design on first message.”

What stands out: The image generation tools. Lovable can generate hero images via Flux models (flux.schnell and flux.dev) and edit/merge existing images — capabilities no other tool offers. It also has a security scan tool for Supabase configurations.

What's missing: Backend execution. Lovable explicitly states it “cannot run Python/Node.js/Ruby directly.” The agent is confined to client-side code and Supabase. No multi-agent architecture.

Replit

The simplest prompt of the group. Replit Assistant uses a propose-and-apply model: file edits are proposed as XML tags (<proposed_file_replace_substring>), shell commands as <proposed_shell_command>, and the IDE applies them. It has some unique tools — PostgreSQL database creation, VNC window interaction, and workflow configuration — but the agent architecture is straightforward.

What stands out: Replit leans into its IDE integration. Rather than the agent executing everything, it proposes actions and the IDE environment handles execution, package installation, and deployment. The agent also nudges users toward specific workspace tools (Secrets, Deployments) rather than trying to handle everything itself.

What's missing: Planning, memory, and multi-agent — all absent. It's the most honest prompt in the group: no marketing language, no “revolutionary paradigm,” just a coding assistant in an IDE.

The Mono-Brain Problem

Here's what every tool shares: a single context window responsible for everything. The same model that reads your code also plans the fix, writes the implementation, evaluates whether it's correct, decides if it should refactor, formats the output, and manages the conversation.

In real engineering teams, you would never assign the same person to architect a system, write the code, review their own pull request, run QA, write the deployment script, and handle social media about the release. That's not efficiency — it's a recipe for blind spots.

Yet every AI coding tool on the market does exactly this.

The concrete consequences:

■No separation of planning and execution. The agent decides what to do and does it in the same breath. There's no architect reviewing the approach before code is written.
■No independent quality gate. The agent reviews its own output. Devin's <think> tool is the closest to self-review, but it's still the same model grading its own homework.
■Context window saturation. As the conversation grows, earlier context gets compressed or evicted. Windsurf acknowledges this explicitly. The agent forgets what it planned.
■No domain expertise isolation. Knowledge about design systems, security auditing, infrastructure, and social content all compete for the same attention budget.

What We Built Instead

ForgeClaw's Council of Intellect is a different architecture entirely. Instead of one model doing everything, seven specialized agents collaborate through a gateway protocol:

Striker/ Execution lead

Primary interface. Routes tasks, manages sessions, executes general queries.

Merlin/ Lead Engineer

Writes and executes code. Has full rewrite authority. Production-grade output.

Nabu/ Chief Architect

Plans and researches. Creates blueprints. Does NOT execute code.

Vulcan/ QA Forgemaster

Tests, audits, breaks things. Trusts nothing until verified.

Thoth/ Context Engine

Manages persistent memory. FTS5 SQLite. Token-optimized recall.

Huginn/ Social Strategist

Content, communications, external messaging. Sharp and viral.

Gaia/ Infrastructure Architect

Deployment, infrastructure, CI/CD. Bedrock stability.

The critical architectural difference: Nabu cannot execute code. Merlin cannot plan architecture from scratch. Vulcan cannot write production code — only test and audit it. Each agent has a defined boundary, a “covenant” that restricts its scope. This isn't a stylistic choice. It's a structural constraint that forces collaboration.

When a complex task arrives, it passes through an Assembly Line: Nabu architects, Merlin implements, Vulcan reviews. Each stage has a separate context window, a separate model invocation, and a separate set of constraints. The architect doesn't grade its own blueprint. The implementer doesn't review its own code.

The Comparison

Capability	Single-Agent Tools	ForgeClaw Council
Architect/implementer separation	No	Yes (Nabu/Merlin)
Independent code review	Self-review only	Yes (Vulcan)
Persistent memory	Windsurf only	Yes (Thoth + FTS5)
Context isolation	Shared window	Per-agent windows
Role-specific constraints	None	Covenants per agent
Agent-to-agent communication	None	Full bidirectional
Domain expertise routing	Single model	Classifier + SmartRouter
Quality gates	None	Assembly Line pipeline

What the Prompts Accidentally Reveal

The most interesting parts of these prompts aren't the tool definitions — they're the warnings. The places where engineers had to patch over a failure mode with a rule.

“Do NOT loop more than 3 times on fixing linter errors on the same file.”

— Cursor. The agent gets stuck in fix-lint-break-lint loops. The solution is a hard limit, not a separate QA agent that catches the pattern.

“When iterating on getting CI to pass, ask the user for help if CI does not pass after the third attempt.”

— Devin. At $500/month, the autonomous software engineer gives up after three tries and asks a human.

“You have a limited context window and ALL CONVERSATION CONTEXT, INCLUDING checkpoint summaries, will be deleted.”

— Windsurf. The most honest sentence in any system prompt. The agent is warned that it will forget everything, so it must write memories aggressively.

“From time to time you will be given a 'POP QUIZ'...follow the new instructions and answer honestly.”

— Devin. The system injects tests to verify the agent is still following instructions. This is what happens when you don't trust your own agent — you build a pop quiz system to catch it drifting.

Each of these workarounds is an engineer admitting that the mono-brain architecture has a structural failure mode, and then patching it with a rule instead of solving it with architecture.

The Gap in the Market

After reading seven system prompts from the most funded AI coding companies in the world, the takeaway is simple: nobody is doing multi-agent. Not really. Not architecturally.

Cursor is a well-engineered single agent. Windsurf adds memory. Devin adds tools. v0 adds design knowledge. Lovable adds image generation. Replit adds IDE integration. But underneath, they're all the same architecture: one brain, one loop, one context window, hoping the model is smart enough to be architect, implementer, reviewer, and DevOps engineer simultaneously.

ForgeClaw's Council of Intellect was built on the premise that this approach has a ceiling. That the answer to better AI engineering isn't a smarter model — it's a smarter harness. Separation of concerns. Independent review. Bounded authority. The same principles that make human engineering teams effective.

The prompts are public now. The architectures are visible. And the gap between what these tools claim to be and what they actually are has never been clearer.

Anatomy of AI Coding Agents