Using AI for coding

The value of 90% of my skills just dropped to $0. The leverage for the remaining 10% went up 1000x. I need to recalibrate. - Kent Beck

https://www.reddit.com/r/ExperiencedDevs/comments/1l1b6n8/i_introduced_agentic_ai_into_my_codebase_two_and/

I introduced agentic AI into my codebase two and a half weeks ago and today I am scrapping it for parts – sort of. As I mentioned in the title, I introduced Agentic AI into my codebase a few weeks ago and I wanted to write down my thoughts. This will likely be a long post, a testimonial of sorts, so I will provide a well-deserved TL;DR for those that are exhausted by all the AI posts. I am a tech lead with 10 YOE, for context.

A few months ago I started working on a social media application (think in the BlueSky space). Not federated (at least not right now), but open source and self-hostable. It was a passion project of mine and everything was written by hand with little-to-no AI help. Development was slow but consistent, the project was open and available, people were chatting with me about it, and I was content. One notable thing though – my available time to dev was extremely hit-or-miss because I have a 5 month old at home. I was only able to focus after everyone else in the house was asleep. So naturally I was keen to try out some of the new agentic solutions that had been released in the past month.

The stack of the project was simple:

React Native (mobile)
Next.js (web)
Nest.js (backend)
Postgres (data)
S3 (object store)

My only experience before this was either querying chatGPT or copilot in VSCode as a stackoverflow replacement. I had even turned off copilot’s autocomplete functionality as I found it to be verbose and incorrect half the time. After setting up (well, navigating to) agent mode in VSCode I gave myself a few ground rules:

No metered models. Agents operate by brute forcing iterations until they assert on the correct output. I do not trust agents with metered models and frankly if something needs enough iteration to be correct I can likely do this myself. I did break this rule when I found out that Sonnet 4 was unlimited until June. Figured “why not” and then I would jump back to GPT 4.1 later. More on that in a bit.

Review every line of code. This was not a vibecoding exercise. I wanted to augment my existing engineering workflow to see how I could increase my development velocity. Just like in real life on real projects, there needs to be a metaphorical meat shield for every line of code generated and merged into the codebase. If this is the future, I want to see how that looks.

No half assing. This may seem obvious, but I wanted to make sure that I followed the documentation and best practices of the agentic workflow. I leveraged copilot-instructions.md extensively, and felt that my codebase was already scaffolded in a way that encouraged strong TDD and rational encapsulation with well-defined APIs. I told myself that I needed this to work to get my project out the door. After all, how could I compete with all the devs who are successfully deploying their projects with a few prompts?

A period of de-disillusionment.

I came into this exercise probably one of the more cynical people about AI development. I have had multiple friends come to me and say “look what I prompted” and showed me some half-baked UI that has zero functionality with only one intended use-case. I would ask them basic questions about their project. How is it deployed? No answer. What technologies are you using? No answer. Does it have security? No answer. I heeded them a warning and wished them good luck, but internally I was seething. Non-technical folks, people that have never worked even adjacently in tech, are now telling me I will lose my job because they can prompt something that doesn’t even qualify as an MVP? These same folks were acting like what I did was wizardry merely a few years ago.

As I had mentioned, I became worried that I was missing out on something. Maybe in the hands of the right individual these tools could “sing” so-to-speak. Maybe this technology had advanced tremendously while I sat on the beach digging my head in the sand. Like most things in this industry, I decided that if I needed to learn it I would just fucking do it and stop complaining about it. I could not ignore the potential of it all.

When I went to introduce “agent mode” to my codebase I was absolutely astonished. It generated entire vertical slices of functionality like a breeze. It compiled the code, it wrote tests, it asserted the functionality against the tests. I kid you not, I did not sleep that night. I was convinced that my job was going to be replaced by AI any day now. It took a ton of the work that I would consider “busy work” a.k.a CRUD on a database and implemented it in 1/5th of the time. Following my own rules, I reviewed the code. I prompted recommendations, did some refactoring, and it handled it all amazingly. This seemed to me at face value as a 3 day story I would assign a junior dev and not have thought twice about it.

I was hooked on this thing like crack at this point. I prompted my ass off generating features and performing refactors. I reviewed the code and it looked fine! I was able to generate around 12k lines of code and delete 5k lines of code in about 2 weeks. In comparison, I had spent around 2 months getting to 20k lines of code or so. I know LOC is not a great metric of productivity, I’ll be the first to admit, but I frankly cannot figure out how else to describe the massive increase in velocity I saw in my code output. It matched my style and syntax, would check linting rules, and would pass my CICD workflows. Again, I was absolutely convinced my days of being a developer were numbered.

Then came week two…

Disillusioned 2: The Electric Boogaloo

I went into week two willing to snort AI prompts off a… well you know. I was absolutely hooked. I had made more progress on my app in the past week than in the past month. My ability to convert my thoughts into code felt natural and an extension of my domain knowledge. The code was functional, clean, with needing little feedback or intervention from the AI’s holy despot – me.

But then, weird stuff started happening. Mind you, I am using what M$ calls a “premium” model. For those that don’t know, these are models that convert inordinate amounts of fossil fuels into shitty react apps that can only do one thing poorly. I’m kidding, sort of, but the point I’m trying to make is these are basically the best models out there right now for coding. Sonnet 4 was just released recently and the Anthropic models have been widely claimed to be the best coding models out there for generative AI. I had broken rule #1 in my thirst for slop and needed only the best.

I started working on a feature that was “basically” the same feature every other social media app has but with a very unique twist (no spoilers). I prompted it with clear instructions. I gave it feedback on where it was going wrong. Every single time, it would either get into an infinite loop or chase the wrong rabbit. Even worse, the agent would take fucking forever to admit it failed. My codebase was also about 12k lines larger at this point, and with that additional 12k lines of code came an inordinate increase in the context of the application. No longer was my agent able to grep for keywords and find 1 or 2 results to iterate on. There were 10, 20, even 30 references sometimes to the pattern it was looking for. Even worse, I knew that every failed iteration of this model would have, if this was after June 3rd(?), be on metered billing. I was getting financially cucked by this AI model every time it failed and it would never even tell me.

I told myself “No I must be the problem. All these super smart people are telling me they can have autonomous agents finishing features without any developer intervention!” I prompted myself a new asshole, digging deep into the code and cleaning up the front-end. I noticed there had been a lot of sneaky code duplication across the codebase that was hard to notice in isolated reviews. I also noticed that names don’t fucking matter to an AI. They will name something the right thing but the functionality has absolutely no guarantee to do that thing. I’ll admit, I probably should have never accepted these changes in the first place. But here’s the thing – these changes looked convincingly good. The AI was confident, had followed my style guide down to the letter, and I was putting in the same amount of mental energy that I put in any junior engineers PR.

I made some progress, but I started to get this sinking feeling of dread as I took a step back and stared at the forest through the trees. This codebase didn’t have the same attention to detail and care that I had. I was no longer proud of it, even after spending a day sending it on a refactor bender.

Then I had an even worse realization. This code is unmaintainable and I don’t trust it.

Some thoughts

I will say, I am still slightly terrified for the future of our industry. AI has emboldened morons with no business ever touching anything resembling code into thinking they are now Software Engineers. It degrades the perception of our role and dilutes the talent pool. It makes it very difficult to identify who is “faking it” vs. who is the real deal. Spoiler alert – it’s not leetcode. These people are convincing cosplayers with an admitted talent for marketing. Other than passive aggressively interrogating my non-technical friends with their own generated projects about real SWE principles, I don’t know how to convince them they don’t know what they don’t know. (Most of them have started their entire project from scratch 3 or 4 times after getting stuck at this point.)

I am still trying to incorporate AI into my workflow. I have decided to fork my project pre-AI into a new repo and start hand implementing all the features I generated from scratch, using the generated code as loose inspiration. I think that’s really what should be the limit of AI – these models should never generate code into a functional codebase. It should either analyze existing code or provide examples as documentation. I try to use the inline cmd+i prompt tool in VScode occassionally with some success. It’s much easier and predictable to prompt a 5 line function than an entire vertical feature.

Anyways, I’d love to hear your thoughts. Am I missing something here? Has this been your experience as well? I feel like I have now seen both sides of the coin and really dug deep into learning what LLM development really is. Much like a lot of hand written code, it seems to be shit all the way down.

Thank you for listening to my TED talk.

TL;DR I tried leveraging agentic AI in my development workflow and it Tyler Durdened me into blowing up my own apartment – I mean codebase.

https://noelrappin.com/blog/2025/05/what-do-i-think-i-think-about-llms/

Caring Matters

I’m working on a rule of thumb here, which is that the usefulness of LLMs in generating code is inversely proportional to how much I care about the code.

One-off script in a language I don’t know very well that is not going to be reused – don’t really care, LLM can probably help a lot.

Throw-away prototype? I literally don’t care about the code itself. LLM probably helpful.

End-to-end test of a system that I already have unit tested? I care that it does what it says, but the style is pretty constrained. LLM could be helpful.

Writing code in a well-constrained piece of business logic? Depending on the constraints, I’m probably arguing about style and long-term issues, so I’m thinking helpful to start and then less so.

Business critical function under web-scale load? I’m going to want to look over that carefully.

Another way of looking at it is that as the code becomes more critical, the typing part that is what the LLM most clearly speeds up becomes less and less of the overall task. (You can get LLMs to help with design, but that takes more back and forth, so it’s not as much of a time saver.)

What an ideal time for LLMs to take the mantle of our great hope; a period of post-modernism, post-truth, a race-to-the-bottom age of charlatans, snake oil salesman and outright conmen, from the very top on down. A day when your opinions hold as much water as my facts.

My few attempts on different codebases are leading to the same conclusion. It tries to mimic us, it’s not doing the job. It gives the impression it’s doing the job. That’s what’s tricky. Doing an impression as a human is hard. But for LLM, that’s their specialty, it’s easy. But an impression, as impressive as it is, is not replacing anyone when you actually require some complex job to be done, over time (LLM can’t comprehend maintainability as we are).

There is no skill in this tool. There is only writing text faster than we can. But what’s an automation tool that you can’t trust? I genuinely get some value in some places, but I still need to do everything myself. I need to use my skill to think to force it to do what I want. Most of the time, it’s just a loss of time to me.

It is most useful for some of the obvious things, like boilerplate code or formatting comments. The dumber the thing you ask it to do, the better it does.

If you’re writing something that it’s not seen much of before - like deep inside a game engine - it’s often very completely wrong with suggestions. Sometimes it hallucinates APIs and tries to call functions that do not exist. Sometimes it gets it completely wrong. Sometimes it does the exact opposite of what you want.

What AI is really good at is making output that looks great. The problem is when our brains see output that looks great, we assume its great. Our defence goes down, we attribute trust - because it looks right. Humans are real suckers for things that look good.

Previously in our life we can easily spot something as being wrong, quite easily because it often looks wrong, and often its so glaringly wrong we can spot it from a mile away. The large commercial LLMs are literally one of mankind’s most impressive creations, running in massive data centres with dedicated power generation - its an engineering marvel, who’s sole purpose and specialty is predicting what token comes next to make the response look like what a correct response should look like.

Don’t forget how these things work. They do not understand causality (proven), they can fake it - but they don’t have a world model to truly understand that if I do XYZ here, then later on this other thing might happen as a consequence.

I’ve always been under the impression that the capabilities of these things are a mile wide and an inch deep. They can get you started. They can provide suggestions and nudge you in the right direction. But they simply don’t have the context or reasoning ability to put something larger and more cohesive together. And that they’ll stick you with doing the boring bit: reading effectively, someone else’s code for hours on end, instead of getting in there and doing it yourself.

I think the scary thing about these agnatic AI products is yeah - they demo extremely well. Enough to convince management types, enough to convince the casual audience that they’re truly capable of anything. Add on some hyperbole and extrapolation of “just think where this technology will be in 2 years time!” and you have the makings of a runaway hype train, from which the mediocre, elated at finally having a tool by which to create a fascimile of competence, scream loudly at the windows for you to jump onboard, destination, straight off a cliff!

I think that’s one of the biggest traps of AI. It’ll work great when you try it out at a small scale. And it’s so tempting to assume that “ok, if it works for this case, it’ll keep working as the project grows. Maybe a bit slower, maybe I’ll have to pay a bit more, whatever, but it’ll keep working”.

But for LLMs, larger context just mean hallucinations. It stops being able to keep track and starts making shit up and getting things wrong without really letting you know that it has hit this scaling limit.

LLMs are just imitation intelligence… The ultimate absolute best version of cargo culting.

Impressive on the surface but unable to actually learn and only imitating understanding.

It’s insane how uncanny this whole thing is with my recent experience. I took on a side project to make a web app with agentic assistance, with the same skepticism; had the same insane high where I was staying up night after night to pump out progress; then broke my own vows and fell into desperate infinite-loop traps; and now ended up completely disillusioned with a technically working but incredibly uncanny codebase.

This may sound crazy, but things like the silent duplication and the seemingly “correct but not actually” stuff is genuinely creepy to me. They arouse some sort of deep, existential fear - that this looks appears manmade on the surface but is really constructed of a “logic” un-humanlike at all. It’s hard to explain - I’ve seen plenty of terribly written monkey code but there’s something inherently different with the way a human fucks up, and the way an LLM’s output fails - the latter is almost deceptive, the yarn spun around itself like a fucking Mobius strip. Even when it does work, it’s still an oddly convoluted logical pattern that just doesn’t make intuitive sense.

So yeah, I’ve come to the same conclusion that I want to start fresh and actually implement everything myself. I think the LLMs are best used to assist with isolated logical blocks, strictly confined in scope, and at best, for consulting-esque discussion of broader design ideas.

I just have to get started… having said all this, it’s hard to bring back the same drive I had those first few crack-fueled weeks.

In my opinion AI is very good at small, clearly defined tasks. But you always have to control what it produces. Because the hallucinations will appear sooner or later and if it has generated a dozen thousand lines of code without being controlled, it’s close to impossible to fix it on your own. Or at least slower than writing the code by yourself….

Links to this note