I’ve been experimenting with something that feels slightly unhinged: using different AI models at different stages of building a feature.
Not because I’m indecisive. Because each model has a different superpower.
GPT-5.2 is great at structured documentation and architectural thinking. Claude Opus 4.6 is terrifyingly good at catching edge cases and writing precise code. So why would I force one model to do everything when I could use them like specialized tools?
This is the story of building a tiny feature called printraw - and how a five-stage, multi-model workflow caught bugs that a single-model approach would have missed entirely.
The Feature: Stop Making Me Fight the Terminal
Here’s the problem: Aye Chat renders AI responses in pretty Rich panels with Markdown formatting and box-drawing characters. Looks great. Feels polished.
But try to copy that text and paste it somewhere else.
You get a mess of line breaks, box characters, and formatting artifacts that make you want to throw your laptop into the sea.
The fix seemed simple: add a printraw command that reprints the last response as plain, copy-friendly text. No panels. No Rich formatting. Just raw text wrapped in delimiters you can select and copy.
The feature itself? Trivial. The workflow I used to build it? That’s what got interesting.
The Five-Stage Pipeline
Here’s what I ended up doing:
| Stage | Task | Model |
|---|---|---|
| 1 | Write the plan | GPT-5.2 |
| 2 | Validate the plan | Claude Opus 4.6 |
| 3 | Implement | Claude Opus 4.6 |
| 4 | Write tests | GPT-5.2 + Claude Opus 4.6 (alternating) |
| 5 | Fix until green | GPT-5.2 + Claude Opus 4.6 (alternating) |
This isn’t “use one model for everything.” It’s a staged pipeline where model selection is intentional - like choosing a screwdriver vs a hammer based on what you’re actually fastening.
Stage 1: Planning with GPT-5.2
I started by describing the UX problem to GPT-5.2 and asking for a complete implementation plan.
GPT-5.2 produced a thorough document covering:
- Command syntax and output format
- Where to capture the last response text
- Two architecture options (store in REPL vs. store in presenter)
- Which files to modify
- Testing approach
- Edge cases
Why GPT-5.2 for planning? It’s genuinely good at organized technical writing. It thinks through tradeoffs without needing to see every line of code. The output was clean, structured, and gave me something concrete to react to.
Stage 2: Validation with Claude Opus 4.6
Here’s where it gets interesting.
I handed the plan to Claude 4.6 with a simple prompt: “Review and validate this plan. Let me know if you’d recommend any adjustments.”
Claude came back with seven specific recommendations, prioritized by impact:
| # | Recommendation | Priority |
|---|---|---|
| 1 | Add raw as a short alias |
High - usability |
| 2 | Use plain print(), not Rich console.print() |
High - correctness |
| 3 | Shorten delimiter lines | Low - taste |
| 4 | Clarify: summary-only output, not file changes | Medium |
| 5 | Treat whitespace-only summary as empty | Medium - edge case |
| 6 | Note that mid-stream printraw is N/A |
Low - docs only |
| 7 | Add Rich-markup-leak test case | Medium - correctness |
Recommendation #2 was the one that made me sit up.
The Rich markup leak problem: If you use Rich’s console.print() to output “raw” text, and the AI’s response happens to contain tokens like [bold] or [red], Rich interprets them as markup instead of printing them literally. Your “raw” output comes out formatted. The whole point of the feature is defeated.
The fix - using Python’s built-in print() - is trivial. But I would have missed it without a dedicated review pass.
Why Claude 4.6 for validation? It’s like hiring a polite pedant to review your work. The structured table with priority ratings made it easy to cherry-pick which adjustments to accept. I took recommendations #1 through #5 and skipped the documentation-only items.
Stage 3: Implementation with Claude Opus 4.6
With a validated plan in hand, I asked Claude to implement it.
The first implementation changed the return types of handle_with_command() and handle_blog_command() from Optional[int] to Tuple[Optional[int], Optional[str]] - threading the response text back to the REPL.
I flagged this immediately: “Won’t that introduce regressions and break existing functionality?”
Claude acknowledged the risk and proposed something cleaner: capture the text at the source of truth - inside print_assistant_response() itself, using a module-level variable.
This approach:
- Required zero signature changes
- Had zero regression risk
- Was guaranteed to capture the correct text (whatever was actually printed)
- Worked automatically for all code paths
Much better.
The Bug That Almost Shipped
Even after the refactor, the first test showed the command printing “No assistant response available yet” after a valid response.
Root cause: the initial code tried to capture text using getattr(llm_response, 'answer_summary', None), but the response object’s attribute was actually .summary, not .answer_summary.
The fix was exactly the module-level capture approach - store the text inside print_assistant_response() where the correct string is guaranteed to exist, regardless of what the response object’s attributes are named.
Final implementation touched four files:
presenter/repl_ui.py- Module variable + getter + capture logicpresenter/raw_output.py- New file: plainprint()with delimiterscontroller/command_handlers.py- Newhandle_printraw_command()handlercontroller/repl.py- Addedprintrawandrawto built-in commands
Stages 4 & 5: The Adversarial Testing Loop
Here’s where the multi-model approach got really interesting.
With implementation done, I didn’t just ask one model to write tests and fix them. I ping-ponged between models: one writes, the other critiques and fixes, repeat.
The test coverage needed to include:
- Normal output with delimiters
- Rich markup leak prevention (the
[bold]something[/bold]case) Noneinput → warning message- Whitespace-only input → warning message
- Empty string → warning message
- The capture mechanism
- The handler integration
Then came the iteration loop - but with a twist.
(ツ» model gpt-5.2
(ツ» write tests for the printraw feature
GPT writes tests.
(ツ» pytest tests/test_raw_output.py -v
Oh God. OH GOD. Red everywhere.
(ツ» model claude-opus-4.6
(ツ» fix the failing tests
Claude fixes things - and often rewrites chunks of GPT’s approach entirely.
(ツ» pytest tests/test_raw_output.py -v
Still red, but fewer failures.
(ツ» model gpt-5.2
(ツ» these tests are still failing, fix them
GPT takes a different angle. Catches something Claude missed.
(ツ» pytest tests/test_raw_output.py -v
Green. Finally green.
Why alternate models? Because each model has different blind spots. GPT might write a test that’s technically correct but uses mocking patterns Claude handles better. Claude might fix the mock but miss an assertion edge case that GPT catches on the next pass.
It’s adversarial collaboration. Each model is essentially reviewing the other’s work, and bugs that survive one model’s scrutiny often get caught by the other.
No context-switching. No copying error messages between terminals. Everything in one session - just swapping which brain is on the case.
Why This Workflow Works
Different models for different cognitive tasks
Planning is a different skill than code review which is a different skill than implementation. Using one model for everything is like using a hammer for screws - it technically works but you’re fighting the tool.
The staged approach catches errors early
The validation stage caught the Rich markup leak before any code was written. Without it, that bug would have surfaced (maybe) when users reported garbled output weeks later.
Regression risk is managed explicitly
By questioning the return-type changes, I avoided an entire class of integration issues. The “capture at the source of truth” pattern emerged from that pushback.
Alternating models surfaces hidden bugs
The ping-pong pattern during testing caught issues that a single model iterating with itself would have missed. Each model brings a different failure mode - and different solutions.
The conversation is the development environment
Every stage happened in the same Aye Chat session:
- Model switching via
modelcommand - File generation via prompts
- Test execution via
pytest - Undo via
restorewhen something went wrong - Diff inspection via
diffto verify changes
No IDE. No separate terminal. No copy-pasting between tools.
The Takeaways
-
Plan first, validate second, implement third. Writing a plan document forces clarity before you touch code.
-
Switch models for validation. The model that wrote the plan won’t catch its own blind spots. A fresh perspective - even from a different AI - brings a different analytical lens.
-
Capture at the source of truth. When multiple code paths need the same data, find the single point where it’s guaranteed to be correct. Don’t thread it through function signatures.
-
Question regression risk explicitly. When implementation requires changing existing contracts, ask: “Is there a way to do this without breaking things?” Usually there is.
-
Alternate models during test/fix loops. One model writes, the other critiques. Bugs that slip past one often get caught by the other. It’s like having two reviewers who never get tired.
-
Keep tests in the same session. Running pytest, reading failures, and fixing them without leaving the terminal keeps iteration tight and fast.
This whole feature - planned, validated, implemented, tested, debugged, and shipped - happened in a single Aye Chat session across a few hours.
Not because the feature was hard. Because the workflow made it frictionless.
About Aye Chat
Aye Chat is an open-source, AI-powered terminal workspace that brings AI directly into command-line workflows. Edit files, run commands, and chat with your codebase without leaving the terminal - with an optimistic workflow backed by instant local snapshots.
Support Us
- Star our GitHub repository - it helps new users discover Aye Chat.
- Spread the word. Share Aye Chat with your team and friends who live in the terminal.