Why I Use SWE-Benchmarked Models and My Process Flow for Maintaining a Well-Architected Software Project

A detailed look at why SWE-Bench scores drive my AI model selection, how I use Claude and GPT-4 across different tasks, and the full lifecycle I follow to keep software projects documented, diagrammed, and maintainable.

workflows-and-processes

There is a particular kind of frustration that comes from watching an AI coding assistant confidently produce something that looks correct but falls apart the moment it touches a real codebase. Wrong imports, hallucinated APIs, subtle logic errors buried under syntactically perfect code. After burning enough hours debugging AI-generated output that looked plausible but wasn't, I changed my approach entirely. I stopped choosing models based on marketing and started choosing them based on SWE-Bench scores — and then I built a process around them that treats documentation, architecture, and planning as first-class citizens rather than afterthoughts.

This post covers why SWE-Bench became my north star for model selection, which models I actually use and how I switch between them, and the full lifecycle I follow to keep projects documented, diagrammed, and maintainable over time.

What SWE-Bench Actually Measures and Why It Matters

SWE-Bench is a benchmark that evaluates language models on their ability to resolve real GitHub issues from real open-source repositories. Not toy problems. Not isolated coding puzzles. Real bug fixes and feature implementations pulled from projects like Django, Flask, scikit-learn, and sympy, where the model has to understand the existing codebase, locate the relevant files, reason about the change needed, and produce a working patch.

This matters because it closely mirrors what I actually need from an AI assistant. I am rarely asking a model to write a function from scratch in isolation. I am asking it to understand context, navigate an existing architecture, and produce code that fits into something that already exists. A model can score well on HumanEval by solving self-contained puzzles and still be useless when dropped into a 50,000-line project with interconnected modules, custom abstractions, and implicit conventions.

The verified subset of SWE-Bench (SWE-Bench Verified) is even more telling because it filters out ambiguous or poorly-specified issues, leaving a cleaner signal of genuine problem-solving ability. When I see a model performing well on SWE-Bench Verified, I know it has demonstrated the ability to read, reason, and write code that integrates into complex systems — which is exactly the workflow I care about.

The Models I Use and How I Choose Between Them

I work primarily with Claude (Opus and Sonnet) and GPT-4-class models from OpenAI, and I actively compare their outputs depending on the task. This is not brand loyalty. It is pragmatic model selection based on what each one does well.

Claude Opus is my default for architectural reasoning, long-context code analysis, and any task where I need the model to hold a large amount of project context in its head simultaneously. When I am feeding it a full project specification, multiple files of existing code, and asking it to reason about how a new feature should integrate, Opus consistently produces the most coherent and architecturally-aware responses. Its SWE-Bench performance reflects this — it excels at the kind of multi-file, context-heavy reasoning that real software engineering demands.

Claude Sonnet is my workhorse for faster iteration cycles. When I am drafting implementations, refactoring individual functions, or generating boilerplate that I know I will review and modify, Sonnet gives me a strong balance of quality and speed. It is remarkably good at following existing code conventions when given examples, which matters enormously for maintaining consistency across a project.

GPT-4 and its variants serve as my second opinion. I regularly run the same prompt through both Claude and GPT-4 when I am making architectural decisions or evaluating tradeoffs. The models have different failure modes, different biases in how they structure code, and different tendencies in their reasoning. When both models converge on the same approach, my confidence goes up significantly. When they diverge, that is often the most valuable signal of all — it means the problem has genuine ambiguity worth thinking through more carefully.

I do not use one model exclusively because no single model dominates across every dimension. SWE-Bench scores give me a baseline, but my own experience refines the selection. A model that scores two percentage points higher on the benchmark but consistently misunderstands my project's conventions is less useful to me than one that scores slightly lower but integrates more naturally into my workflow.

The Full Lifecycle: Planning Through Maintenance

The real value of using capable models is not in generating code faster. It is in maintaining a disciplined process where every phase of the project is documented, diagrammed, and deliberate. Here is how I structure that lifecycle.

Phase 1: Requirements and Scope Definition

Every project starts with a written specification, and I mean actually written — not a mental model, not a conversation, not a collection of sticky notes. I write a structured requirements document that covers the problem being solved, the constraints I am operating under, the users or systems that will interact with the output, and the explicit non-goals (things this project will not do).

I use AI models at this stage as a thinking partner. I will describe the problem space to Claude Opus, ask it to identify ambiguities or gaps in my requirements, and iterate on the spec based on what it surfaces. This is not the model writing my requirements for me. It is the model stress-testing what I have written by asking the kinds of questions a thoughtful colleague would ask in a design review.

The output of this phase is a document I can hand to anyone — another developer, a stakeholder, my future self in six months — and they can understand what this project is and is not.

Phase 2: Architecture and System Design

Once requirements are locked, I move into architecture. This is where diagramming becomes essential. I produce diagrams for every project, and I treat them as living artifacts, not decorative documentation.

My standard set of architectural diagrams includes system context diagrams showing how the project fits into its broader environment, container diagrams breaking the system into deployable units, component diagrams showing the internal structure of each container, and sequence diagrams for any non-trivial interactions between components.

I use AI models to generate initial diagram definitions in Mermaid or PlantUML syntax, then refine them manually. The model is good at producing a structurally correct first pass from a natural language description of the architecture, and I am good at catching the places where the model oversimplified a relationship or missed a dependency. This collaboration produces better diagrams faster than either of us would alone.

The architecture phase also includes explicit decisions about technology choices, data flow patterns, error handling strategies, and integration points. Each decision gets documented with its rationale — not just what we chose, but why we chose it and what we considered and rejected. This decision log becomes invaluable when revisiting the project later and wondering why something was done a particular way.

Phase 3: Detailed Design and Interface Contracts

Before writing implementation code, I define the interfaces between components. This means API contracts, data schemas, function signatures with type annotations, and clear definitions of who owns what state. In practice, this looks like writing detailed type definitions, protobuf schemas, or OpenAPI specs depending on the project.

AI models are excellent at this phase because they can rapidly generate interface definitions from architectural descriptions and then help identify inconsistencies. I will describe a component's responsibilities in plain language, have the model produce a typed interface, review it for correctness, and then use that interface as the contract that implementation must satisfy.

This is the phase where I catch the most design problems. An interface that is awkward to define is usually a sign that the component boundaries are wrong. By front-loading this work before any implementation begins, I avoid the much more expensive process of discovering architectural misalignments after code is written.

Phase 4: Implementation with Continuous Documentation

Implementation is where the models earn their SWE-Bench scores. I work in small, well-scoped increments. Each unit of work has a clear objective, clear inputs and outputs, and a clear relationship to the interfaces defined in Phase 3.

I use AI models to generate initial implementations against those interface contracts, and I review every line. The review process is where my knowledge of the project's conventions, edge cases, and operational requirements comes in. The model produces structurally correct code rapidly; I ensure it is semantically correct for this specific context.

Documentation happens continuously during implementation, not after. Every module gets a docstring explaining its purpose and its relationship to the architecture. Complex logic gets inline comments explaining why, not what. Configuration choices get documented where they are defined. I use the AI model to draft documentation from the code and then edit it for accuracy and tone, which is dramatically faster than writing documentation from scratch.

Phase 5: Testing and Validation

Testing follows implementation at every level. Unit tests validate individual components against their interface contracts. Integration tests validate the interactions defined in the sequence diagrams. End-to-end tests validate the user-facing behaviors defined in the requirements document.

I use AI models to generate test scaffolding and edge case identification. The model is particularly good at enumerating edge cases I might not think of — boundary conditions, null inputs, race conditions in concurrent code, malformed data in parsing logic. I review and extend the generated tests, but the model's ability to systematically enumerate failure modes saves significant time.

The key discipline here is that test coverage maps directly back to the requirements and architectural decisions documented in earlier phases. Every test should trace to a requirement. Every requirement should have tests. The documentation trail makes this traceability explicit rather than implicit.

Phase 6: Review, Refactoring, and Maintenance

Projects do not end at initial implementation. The documentation and diagrams created in earlier phases serve as living references that evolve with the codebase. When I modify a component, I update its diagram. When I change an interface, I update the contract documentation. When I add a capability, I trace it back through the requirements document.

I use AI models for code review by feeding them both the code and the architectural documentation, then asking them to identify places where the implementation has drifted from the documented design. This is a powerful technique because drift happens gradually and invisibly — a small shortcut here, a quick workaround there — and without systematic checking, the documentation becomes fiction.

Periodic architecture reviews, where I revisit the system diagrams and ask whether the implemented system still matches the documented architecture, are a critical maintenance practice. AI models help by doing the tedious comparison work, flagging discrepancies for my review.

Why This Process Works: The Compound Effect of Documentation

The individual phases are not revolutionary. People have been writing specs, drawing diagrams, and testing code for decades. What makes this process effective is the compound effect of doing all of them consistently, with AI models amplifying each phase.

When requirements are written down, architecture discussions become productive because everyone is working from the same understanding. When architecture is diagrammed, implementation stays on track because the structure is visible, not just imagined. When interfaces are defined before implementation, integration problems surface early when they are cheap to fix. When documentation is maintained continuously, onboarding new contributors (including your future self) takes hours instead of weeks.

The AI models make this process sustainable by handling the high-volume, pattern-heavy work — generating initial drafts, enumerating edge cases, maintaining consistency across documents — while I focus on the judgment-heavy work of deciding what to build, how to structure it, and whether the output is correct for this specific context.

SWE-Bench scores told me which models could handle the code generation part. The process I built around them ensures that code generation happens within a structure that keeps the project maintainable, understandable, and well-documented over its entire lifetime.

Closing Thought

Choosing AI models based on SWE-Bench performance is table stakes. The real leverage comes from building a disciplined process that treats every phase of software development — from requirements through maintenance — as something worth documenting, diagramming, and reviewing. The models are tools. The process is the craft. When both are strong, the output is software that not only works today but remains comprehensible and maintainable long after the initial implementation is done.

My Current UX Research Toolkit & Workflow ›