The Software Engineer is Dead, Long Live the Software Engineer!

[Image composed with DALL·E]

An undeniable killer use case for large language models is code generation. It’s already helping software developers write code up to twice as fast and with more delight.

As LLM models improve with each release, they can write better software and handle more complex architectures at a fraction of the cost of older-generation models. Take this trajectory to its most freakish conclusion and you’ll have a future where AI writes all of the world’s code. In that instance, we won’t need software engineers, right?

This future is far more distant than sensationalists would have you believe. The genAI hype obscures the reality of what’s likely to happen over the next few years. I know this because even though I’m not a professional software developer, I write code with AI assistance every week, and my day job is to find and invest in talented tech entrepreneurs. I’ve seen first-hand and heard from other developers what makes LLMs brilliant and the areas where AI falls short.

[An example of code autocomplete in Python]

For example, try building a full-stack app with dozens of files and multiple API services. You’ll find that LLMs often get lost in the sauce when you have thousands of lines of code. No doubt, longer context windows (how much you can stuff into a model’s prompt) will solve some of this, but how about code maintenance?

If an AI writes most (or all) of the code, you still need someone who can interrogate and navigate the codebase. Why? Well, if the features of some arcane library get deprecated, for example, and the AI lacks sufficient training on what’s new, you’ll need a human to step in and help.

Or consider the seemingly simple task of understanding what users want or need. Software engineers don’t just write code. Knowing the syntax of a computer language is the easiest part of the job. You must also talk to people, understand their needs, distil those requirements into something more precise and achievable, and then balance the costs and benefits of various tools and approaches. AI will get good at this someday, but the world is too complex for LLMs to eliminate all software engineering jobs in the near term.

Some software engineers are already being replaced, though. People who use AI to compose better software are starting to replace those who don’t. And these AI-augmented engineers will have more work than ever. This is because things that were previously too expensive to automate will now be cheaper to codify. Just think about all the weird manual processes you tolerate because there’s no good software for it. AI-augmented software developers and hobbyists will be able to tackle these areas with greater speed and ease.

In my own life, I’ve composed code with AI to complete various mundane tasks. For example, I wrote a Python script to export my Spotify playlists to another streaming platform instead of paying for some obscure software. I hacked together a full stack app to help me find meeting venues. I built an app to summarise YouTube videos instead of waiting for Google to make the feature. I’ve also experimented with AI agents that conduct market research, among other things.

The software engineer of the pre-LLM era is clearly dead. But long live the software engineer who augments their craft with AI. Just as high-level programming languages eliminated the need to write tedious low-level machine code, AI will free up software engineers to focus on more creativity and innovation—two things the world needs now more than ever.

Day 100 of 100 Days of AI 🎉

AI agents have been getting a lot of attention but they can be expensive to run. This is because they use a lot of tokens and make multiple calls to LLMs. Some AI developers have beat performance benchmarks by using exceptionally expensive agentic runs. However, this is unrealistic because in real-world business applications, costs matter.

This brings me to this new AI paper from researchers at Princeton University. The chart below shows what they have achieved. In a nutshell, the researchers argue that cost should be considered when running evaluations of AI agent systems. In addition, the paper highlights techniques that can produce high accuracy at a massively lower cost! See the chart below on accuracy versus cost.

A sample of techniques used to generate high accuracy at lower cost are listed below. They are also labelled on the dots in the chart above.

Retrying Strategy: This is where you set the temperature of a model to zero, and call it a fixed number of times if it keeps failing a specified test. You can do this with cheaper models like GPT 3.5, but call them repeatedly just a handful of times until you get a desired answer. Note that even when temperature is set to zero, these models are still stochastic.

Warming Strategy: Here you do the same as the retry strategy, but you incrementally increase the temperature at each try. You go from temperature 0 to 0.5. Tuning up the temperature (aka randomness) increases the variety of answers and could lead to a successful result quicker and more cheaply.

Escalation Strategy: You start with a cheaper model and escalate to more intelligent, expensive models only if the first prompt attempts fail. This reserves resources and until the cheaper models fail.

The paper has more on the topic and it’s worth reading in full if you’re building real-world agentic applications.

Day 99 of 100 Days of AI

I switched one of my AI agents projects from GPT-4o to the new Claude 3.5 Sonnet model. The agents now run with a cost that’s roughly 85% cheaper and the final results arrive about 15% faster.

To upgrade my AI agents project, all I had to do was write the code below. This is one of the great advantages of writing apps where models are interchangeable. Every few months, we get better, faster, and cheaper upgrades. Moreover, developers can benefit from new AI intelligence with minimal code changes.

Day 98 of 100 Days of AI

On the way home today I was listening this podcast — “Lessons from a Year of Building with LLMs“. The biggest takeaway? While everyone has been building lots of experimental projects with LLMs and often shipping to production scrappy builds of seemingly impressive demos (I’m guilty of this!), few people are running effective evaluations (“evals”) of their LLM apps.

To build LLM apps that can be taken seriously and used reliably, it’s important to spend time understanding what performance metrics matter and how to improve them. To do that, you have to plan and implement evals early on in the process. You then need to continuously monitor performance.

One simple example the podcast guests gave was that if you are building a LLM app that should respond with 3 bullet points each time, you can build tests or a metric that tracks what percentage of responses adhere to 3 bullet points. (Note: LLMs are probabilistic and so they might not hit 100%). This enables you to catch errors and iteratively improve your app.

Day 97 of 100 Days of AI

AI Agents are a mixed bag. I tried this morning to get a group of CrewAI agents to complete some Youtube summaries and they kept failing or going off the rails. My deterministic version which I hard coded works significantly better than leaving things to a group of autonomous agents.

The issue, it seems, is that I tried to delegate too much to the agents in this instance. I ask them to Google around to find Youtube videos instead of using the Youtube API, for example. And I had also left it up to them to decide what kind of search query to conduct on a specific video. This is a small example of cases where agents aren’t yet smart enough to figure things out on their own. That said, it’s quite easy to build guardrails and guidance to improve results. I just expected a bit too much, too early on.

Day 96 of 100 Days of AI

There’s an article in the Wall Street Journal I came across today with the headline below.

The article ends with a quote from a Google Cloud Chief Evangelist, Richard Seroter: “If you don’t have your data house in order, AI is going to be less valuable than it would be if it was…You can’t just buy six units of AI and then magically change your business.”

Gen AI applications have strong RoI in places like code, text, and media generation. But when it comes to making sense of data or using AI agents to drive critical business decisions, we should hold scepticism since these generative models still lack strong reasoning capabilities.

Day 95 of 100 Days of AI

Claude AI is really good at coding. In one shot, I asked it to create streamlit app that I can use to chat to my PDFs with a local, offline, large language model. The first attempt had a roughly working version and with just 2 more follow-on prompts and refinement, I had a fully working RAG application on my machine that doesn’t need the internet. It’s slow but works.

It’s insane how rapidly you can prototype software applications ideas with the help of LLMs, especially if you’re not a professional software developer.

Below is the prompt I gave to Claude.

And here’s the working streamlit app.

Day 94 of 100 Days of AI

Just 18 months ago, OpenAI released GPT-3.5 Turbo which had double the input token context window of its predecessor, GPT-3. We went from 2048 tokens to 4096 tokens and that felt like a significant leap. But today, we are enjoying context windows of 128,000 tokens with GPT4o.

How much further can we go? Today, I was perusing this Google paper and it turns out that their research team can achieve a 10 million token context window Gemini 1.5! Not just that, but as you can see in the charts below from the June 2024 update of the paper, the model achieves almost perfect recall across the equivalent of 7 million words or up to 107 hours of audio or 10 hours of video. These are incredibly impressive results!

Day 93 of 100 Days of AI

A couple of experiments today: I completed a simple tutorial that helps you build a RAG app using local models. I also installed and tested https://openwebui.com/. This is a really cool open source project that gives you a ChatGPT-style interface to use with large language models that run locally and offline on your machine. You can see an example of it below.

Running LLMs locally doesn’t generate the same performance as the big cloud-based models. However, if you’re travelling and don’t have internet access, it’s a great hack!

From Openweb-UI, Github

Day 92 of 100 Days of AI

I’ve just finished listening to this podcast with the founders of https://www.factory.ai/. Their startup automates away the drudgery bits of work in software engineering. That includes writing tests, debugging, documentation, and migrations.

I found the views of the founders compelling. For example, they don’t think AI will replace software engineers en mass and suddenly. Instead, we are likely to see gradual automation, task by task. For example, Factory.ai has bots that can automatically review code changes; they have a bot that can rewrite code and improve it; and they also have bots that can plan larger projects.

What they aren’t building, and what isn’t yet achievable, is a fully autonomous AI software engineer. We are still some way away from that and companies that are attempting to build such systems will probably struggle to create real business value. (Hint: Those systems aren’t yet reliable!)

I found the podcast insightful and it’s worth a watch if you’re interested in the automation of work or are building a company in the space.