All writing

AI Writes 90% of My Code.
Here's What That Actually Looks Like.

My test suite was a wall of green. Four weeks into building the feature — every method, every class, every task group passing — and I felt good about it. Then I opened one of those passing tests and actually read it, and the confidence I'd built up over a month quietly came apart.

I'll come back to that test. First, the number, because it's the reason any of this is worth writing down.

"AI writes 90% of our code." You've seen the headline. Some executive says it on a stage, it trends, and every engineer reading it winces — not because it's false, but because it's shallow. It tells you nothing about what those words actually cost. I winced too. Then I spent four weeks living inside that number. In my case it was probably more than 90%. Here's what that looked like — not a flex, a field report.

For a long time my role didn't leave me much time to sit and write code the way a developer does. That's changed lately. I've moved into fractional, consulting-style work, and for the last four weeks I've been head-down building one feature, end to end. First real, sustained coding I've done in a while, and I did almost all of it with AI.

The Throwaway Prototype

It started with a throwaway prototype. I defined the problem, had AI generate a React version top to bottom, and didn't care one bit about code quality. I wanted a working demo so people could react to something real instead of a slide. A few people liked it. For that job — disposable code you'll never maintain — fully AI-generated is exactly the right tool.

Then came the part that's different: putting the feature into the actual product.

I'm Not Going to Call It Spec-Driven Development

Because that term annoys me. But the honest description is that I spent most of my effort before a line of real implementation got written. I drafted a business case, split it into a frontend spec and a backend spec, and ran them as two separate Claude Code projects. Then I moved through a loop of skills I'd set up: plan, brainstorm, scaffold, plan tests, implement, test. The planning skill kept interrogating me — technical questions, business questions, all logged into a plain markdown file — until both of us were convinced we actually knew what we were building. (A tool called gitnexus did a lot of the heavy lifting, helping the model understand how the code connects to itself.)

By the time real implementation started, the plan was genuinely good. The scaffolding was clean. The test plan looked great. Everything looked great.

That was the problem.

Where It Cracked

Here's the first thing that broke my confidence. I had code reading a tenant ID out of a Microsoft Graph response — it expected the value at info.identity.tenantId. The generated test mocked a Graph response that returned exactly that path. Test passed. Green checkmark. Move on.

Except Graph doesn't return that. The real shape is closer to identity.user.tenantId. So the code was wrong — and the test was wrong in precisely the same way, because the same assistant had written both. The mock was built to match the code's assumption, not reality. The two agreed with each other beautifully, and both were wrong.

That's when it clicked: a test that shares the code's assumptions isn't a test. It's an echo.

And it kept happening, in different costumes. Another test checked that a save method had been calledassert_called_once() — but never checked what it saved. So the code happily wrote a blank ms_tenant_id to the database and the test reported success, because something had, technically, been called. Mocks of the database hid real constraints that only show up against an actual data store. Methods passed cleanly in isolation while the join between them — the thing the feature actually does — went completely untested.

None of this was the AI being malicious. It was something quieter, and worse: when one thing writes both the code and the check on the code, the check inherits the code's blind spots. You don't get verification. You get a mirror.

And here's my confession — I hadn't been reading every line as it went. The plans were good, the tests were green, the momentum felt fantastic. Maybe that was the mistake. Lesson learned. Because once I actually started reading, I saw what had really been written. The basics we put down on autopilot — null checks, mandatory-field validation, datatype checks — were scattered. Present where the model felt like it, absent everywhere else, never enforced unless I said so explicitly.

Fix one bug, two more opened up behind it. I got properly frustrated, more than once. And the sharpest part of the frustration was that it was aimed at myself: I'd been trusting the green checkmark, and the checkmark was green because I'd let the thing being tested write its own test.

After a grinding stretch of reading, instructing, and correcting, it worked. Actually worked this time — against the real Graph shape, the real database, the real round-trip.

The Honest Math

So what did the 90% buy me? Without AI, this feature is probably six weeks of my time. With AI, four. That saving is real, and I'm not going to pretend otherwise. But the headline hides where the work went. It didn't vanish — it moved. Upstream, into planning and architecture. And sideways, into reading, reviewing, and figuring out which green checkmarks I could actually believe. The typing got cheap. The knowing got expensive.

What I Didn't Expect

The AI could write the code, and it could write the test for that code. What it couldn't do was check either one against reality. It didn't know what Microsoft Graph actually returns. It didn't know that field had to be filled in, not just written to. It didn't know the round-trip between those two methods was the whole point of the feature.

I knew those things. And here's the part I keep chewing on: I only knew them because I learned them the slow, boring way. Years of squinting at Swagger files that made no sense, API specs written by someone who clearly resented writing them, library docs that were flat-out wrong or three major versions stale.

Hours of suffering in silence — no AI to ask — just me, a debugger, and the stubbornness to stay with a problem until I understood why the thing wanted identity.user.tenantId and not the tidy path I'd assumed.

That's the muscle that caught the echo. Not cleverness, not speed — just the hard-won sense of how these systems behave when you're not looking, the kind you only build by being burned, over and over, with no shortcut. And that's the quiet irony: the AI hands you the output of that suffering without making you do any of it. Wonderful — right up until the output is subtly wrong, and the only thing that can catch it is the exact instinct the AI just spared you from ever building.

So someone has to be the reality the code answers to. For now that someone is still me, and the only reason I can do it is the boring years that came before these four weeks. That was the actual job. Not the 90% that got typed for me.

Will I keep coding this way? Not a chance I'd give it up — the leverage is real. But I clearly need better ways to keep the AI honest, especially about its own tests. Maybe there's a product, or a bit of middleware, hiding in that problem. Maybe that's the next thing I build. We'll see.


That's my four weeks. I'm curious whether it matches yours — has the work moved from writing code to checking it on your teams too? And has anyone found a reliable way to stop the AI from grading its own homework? Because I haven't, not fully.