I’ve been experimenting with something that feels slightly unhinged: using different AI models at different stages of building a feature.

Not because I’m indecisive. Because each model has a different superpower.

GPT-5.2 is great at structured documentation and architectural thinking. Claude Opus 4.6 is terrifyingly good at catching edge cases and writing precise code. So why would I force one model to do everything when I could use them like specialized tools?

This is the story of building a tiny feature called printraw - and how a five-stage, multi-model workflow caught bugs that a single-model approach would have missed entirely.


The Feature: Stop Making Me Fight the Terminal

Here’s the problem: Aye Chat renders AI responses in pretty Rich panels with Markdown formatting and box-drawing characters. Looks great. Feels polished.

But try to copy that text and paste it somewhere else.

You get a mess of line breaks, box characters, and formatting artifacts that make you want to throw your laptop into the sea.

The fix seemed simple: add a printraw command that reprints the last response as plain, copy-friendly text. No panels. No Rich formatting. Just raw text wrapped in delimiters you can select and copy.

The feature itself? Trivial. The workflow I used to build it? That’s what got interesting.


The Five-Stage Pipeline

Here’s what I ended up doing:

Stage Task Model
1 Write the plan GPT-5.2
2 Validate the plan Claude Opus 4.6
3 Implement Claude Opus 4.6
4 Write tests GPT-5.2 + Claude Opus 4.6 (alternating)
5 Fix until green GPT-5.2 + Claude Opus 4.6 (alternating)

This isn’t “use one model for everything.” It’s a staged pipeline where model selection is intentional - like choosing a screwdriver vs a hammer based on what you’re actually fastening.


Stage 1: Planning with GPT-5.2

I started by describing the UX problem to GPT-5.2 and asking for a complete implementation plan.

GPT-5.2 produced a thorough document covering:

  • Command syntax and output format
  • Where to capture the last response text
  • Two architecture options (store in REPL vs. store in presenter)
  • Which files to modify
  • Testing approach
  • Edge cases

Why GPT-5.2 for planning? It’s genuinely good at organized technical writing. It thinks through tradeoffs without needing to see every line of code. The output was clean, structured, and gave me something concrete to react to.


Stage 2: Validation with Claude Opus 4.6

Here’s where it gets interesting.

I handed the plan to Claude 4.6 with a simple prompt: “Review and validate this plan. Let me know if you’d recommend any adjustments.”

Claude came back with seven specific recommendations, prioritized by impact:

# Recommendation Priority
1 Add raw as a short alias High - usability
2 Use plain print(), not Rich console.print() High - correctness
3 Shorten delimiter lines Low - taste
4 Clarify: summary-only output, not file changes Medium
5 Treat whitespace-only summary as empty Medium - edge case
6 Note that mid-stream printraw is N/A Low - docs only
7 Add Rich-markup-leak test case Medium - correctness

Recommendation #2 was the one that made me sit up.

The Rich markup leak problem: If you use Rich’s console.print() to output “raw” text, and the AI’s response happens to contain tokens like [bold] or [red], Rich interprets them as markup instead of printing them literally. Your “raw” output comes out formatted. The whole point of the feature is defeated.

The fix - using Python’s built-in print() - is trivial. But I would have missed it without a dedicated review pass.

Why Claude 4.6 for validation? It’s like hiring a polite pedant to review your work. The structured table with priority ratings made it easy to cherry-pick which adjustments to accept. I took recommendations #1 through #5 and skipped the documentation-only items.


Stage 3: Implementation with Claude Opus 4.6

With a validated plan in hand, I asked Claude to implement it.

The first implementation changed the return types of handle_with_command() and handle_blog_command() from Optional[int] to Tuple[Optional[int], Optional[str]] - threading the response text back to the REPL.

I flagged this immediately: “Won’t that introduce regressions and break existing functionality?”

Claude acknowledged the risk and proposed something cleaner: capture the text at the source of truth - inside print_assistant_response() itself, using a module-level variable.

This approach:

  • Required zero signature changes
  • Had zero regression risk
  • Was guaranteed to capture the correct text (whatever was actually printed)
  • Worked automatically for all code paths

Much better.

The Bug That Almost Shipped

Even after the refactor, the first test showed the command printing “No assistant response available yet” after a valid response.

Root cause: the initial code tried to capture text using getattr(llm_response, 'answer_summary', None), but the response object’s attribute was actually .summary, not .answer_summary.

The fix was exactly the module-level capture approach - store the text inside print_assistant_response() where the correct string is guaranteed to exist, regardless of what the response object’s attributes are named.

Final implementation touched four files:

  • presenter/repl_ui.py - Module variable + getter + capture logic
  • presenter/raw_output.py - New file: plain print() with delimiters
  • controller/command_handlers.py - New handle_printraw_command() handler
  • controller/repl.py - Added printraw and raw to built-in commands

Stages 4 & 5: The Adversarial Testing Loop

Here’s where the multi-model approach got really interesting.

With implementation done, I didn’t just ask one model to write tests and fix them. I ping-ponged between models: one writes, the other critiques and fixes, repeat.

The test coverage needed to include:

  • Normal output with delimiters
  • Rich markup leak prevention (the [bold]something[/bold] case)
  • None input → warning message
  • Whitespace-only input → warning message
  • Empty string → warning message
  • The capture mechanism
  • The handler integration

Then came the iteration loop - but with a twist.

(ツ» model gpt-5.2
(ツ» write tests for the printraw feature

GPT writes tests.

(ツ» pytest tests/test_raw_output.py -v

Oh God. OH GOD. Red everywhere.

(ツ» model claude-opus-4.6
(ツ» fix the failing tests

Claude fixes things - and often rewrites chunks of GPT’s approach entirely.

(ツ» pytest tests/test_raw_output.py -v

Still red, but fewer failures.

(ツ» model gpt-5.2
(ツ» these tests are still failing, fix them

GPT takes a different angle. Catches something Claude missed.

(ツ» pytest tests/test_raw_output.py -v

Green. Finally green.

Why alternate models? Because each model has different blind spots. GPT might write a test that’s technically correct but uses mocking patterns Claude handles better. Claude might fix the mock but miss an assertion edge case that GPT catches on the next pass.

It’s adversarial collaboration. Each model is essentially reviewing the other’s work, and bugs that survive one model’s scrutiny often get caught by the other.

No context-switching. No copying error messages between terminals. Everything in one session - just swapping which brain is on the case.


Why This Workflow Works

Different models for different cognitive tasks

Planning is a different skill than code review which is a different skill than implementation. Using one model for everything is like using a hammer for screws - it technically works but you’re fighting the tool.

The staged approach catches errors early

The validation stage caught the Rich markup leak before any code was written. Without it, that bug would have surfaced (maybe) when users reported garbled output weeks later.

Regression risk is managed explicitly

By questioning the return-type changes, I avoided an entire class of integration issues. The “capture at the source of truth” pattern emerged from that pushback.

Alternating models surfaces hidden bugs

The ping-pong pattern during testing caught issues that a single model iterating with itself would have missed. Each model brings a different failure mode - and different solutions.

The conversation is the development environment

Every stage happened in the same Aye Chat session:

  • Model switching via model command
  • File generation via prompts
  • Test execution via pytest
  • Undo via restore when something went wrong
  • Diff inspection via diff to verify changes

No IDE. No separate terminal. No copy-pasting between tools.


The Takeaways

  1. Plan first, validate second, implement third. Writing a plan document forces clarity before you touch code.

  2. Switch models for validation. The model that wrote the plan won’t catch its own blind spots. A fresh perspective - even from a different AI - brings a different analytical lens.

  3. Capture at the source of truth. When multiple code paths need the same data, find the single point where it’s guaranteed to be correct. Don’t thread it through function signatures.

  4. Question regression risk explicitly. When implementation requires changing existing contracts, ask: “Is there a way to do this without breaking things?” Usually there is.

  5. Alternate models during test/fix loops. One model writes, the other critiques. Bugs that slip past one often get caught by the other. It’s like having two reviewers who never get tired.

  6. Keep tests in the same session. Running pytest, reading failures, and fixing them without leaving the terminal keeps iteration tight and fast.

This whole feature - planned, validated, implemented, tested, debugged, and shipped - happened in a single Aye Chat session across a few hours.

Not because the feature was hard. Because the workflow made it frictionless.


About Aye Chat

Aye Chat is an open-source, AI-powered terminal workspace that brings AI directly into command-line workflows. Edit files, run commands, and chat with your codebase without leaving the terminal - with an optimistic workflow backed by instant local snapshots.

Support Us

  • Star our GitHub repository - it helps new users discover Aye Chat.
  • Spread the word. Share Aye Chat with your team and friends who live in the terminal.