Open Problems (& Opportunities) for AI: Summary Notes from the Conference

The AE Summit Poster

Last week I attended the Algorithmic Innovation & Entrepreneurship Summit on Open Problems for AI. That’s a bit of a mouthful but in essence, it was a 2-day conference covering the challenges and opportunities for AI over the next 5 years.

Day 1 was focussed on algorithms and AI techniques, while Day 2 focussed on applications.

In the post below I’ll share some of the most interesting takeaways that stuck with me. You can also find summaries of all the talks here.

I found most of the content accessible (thanks to my 100 days of AI learning). However, there were some ideas, such markov blankets and probabilistic programming, that were so foreign I had to come back to them after the event.

Still, you’ll find that the highlights below don’t need a technical foundation to appreciate.

Fun Facts

💧 A 100-word email generated by an AI chatbot like ChatGPT requires about 500ml of water to cool data centers. If 10% of working Americans used this tool weekly, we would need 435 billion liters of water to cool the machines responsible.

🐴 The best way to train to be a prompt engineer might be to meditate more. Turns out, having an introspective awareness of your own thought processes could be key to unlocking better LLM outputs.

🤖 That said, you don’t actually need technical skills to use AI well. Three-quarters of non-tech companies have used ChatGPT at least once, even though the technology is less than two years old.

Conference Highlights

The holy grail of artificial general intelligence (an AI that can do anything a human can) will require open-ended learning. This means we need to find a way to have AIs set their own objectives in the real world, and learn openly and endlessly, just as humans do.

Training AI is currently very inefficient. For example LLMs are being trained on trillions of tokens (“the whole internet”, as one speaker put it), which is wasteful since you get mixed quality and redundant information. Smarter techniques are emerging where you get better results with 50-90% less data.

Using GenAI is also expensive in environmental terms. One speaker shared the example that just one AI-generated email consisting of 100 words requires 500ml of water to cool data centers. Scaling this to the working population of the USA using such a tool weekly would require 435bn liters of water annually.

We must be cognizant of the many key weaknesses of LLMs. Issues such as bias, privacy and data leaks, as well limited or no explainability in how AI makes decisions are worth highlighting.

There are also limits to techniques like reinforcement learning. For instance, it assumes perfect feedback mechanisms, complete access to all parts of a system, and perfect environments—none of which are reflective of the real world. We need to explore how we can train AI systems in more realistic and rough environments.

The ability for LLMs to reason is a promising path. For example OpenAI’s recent O1 models have been taught how to reason, rather than just generate text. OpenAI expect these models to help advance scientific research.

Despite the AI hype, adoption and productivity improvements are yet to take off. Lots of people and organizations have experimented with ChatGPT but transformational changes are yet to be seen. I think software engineers are seeing the most benefits so far with coding co-pilots. Text summarisation and structuring unstructured data is also useful, but radical features such as AI agents doing human work are yet to create meaningful value.

AI and education haven’t yet seen radical adoption either. The space has the potential to offer adaptive and personalised learning, which could transform education outcomes. Whether that’s text-to-speech systems that help students with dyslexia, or personalized learning pathways for students depending on their ability. There’s lots of potential here but adoption has been slow.

AI in medicine is also facing adoption challenges. The technology is there but it’s tough to deploy it. Obstacles include implementation hurdles (e.g. in the NHS), incentive structures where treatment rather than prevention is prioritised, and challenging customer behaviours (e.g. patients not adhering to monitoring device use).

Future AI directions will require moving beyond just massive data and computing power. The field is shifting toward the smarter use of data through prioritization, noise filtering, and synthetic data generation. The conference also covered sample-efficient algorithms, open-ended systems that learn on their own, multimodal AI that integrates language, vision, and actions in the real world, as well as multi-agent systems. These developments will hopefully be guided by evaluation frameworks that focus on what the world actually needs.

The Software Engineer is Dead, Long Live the Software Engineer!

[Image composed with DALL¡E]

An undeniable killer use case for large language models is code generation. It’s already helping software developers write code up to twice as fast and with more delight.

As LLM models improve with each release, they can write better software and handle more complex architectures at a fraction of the cost of older-generation models. Take this trajectory to its most freakish conclusion and you’ll have a future where AI writes all of the world’s code. In that instance, we won’t need software engineers, right?

This future is far more distant than sensationalists would have you believe. The genAI hype obscures the reality of what’s likely to happen over the next few years. I know this because even though I’m not a professional software developer, I write code with AI assistance every week, and my day job is to find and invest in talented tech entrepreneurs. I’ve seen first-hand and heard from other developers what makes LLMs brilliant and the areas where AI falls short.

[An example of code autocomplete in Python]

For example, try building a full-stack app with dozens of files and multiple API services. You’ll find that LLMs often get lost in the sauce when you have thousands of lines of code. No doubt, longer context windows (how much you can stuff into a model’s prompt) will solve some of this, but how about code maintenance?

If an AI writes most (or all) of the code, you still need someone who can interrogate and navigate the codebase. Why? Well, if the features of some arcane library get deprecated, for example, and the AI lacks sufficient training on what’s new, you’ll need a human to step in and help.

Or consider the seemingly simple task of understanding what users want or need. Software engineers don’t just write code. Knowing the syntax of a computer language is the easiest part of the job. You must also talk to people, understand their needs, distil those requirements into something more precise and achievable, and then balance the costs and benefits of various tools and approaches. AI will get good at this someday, but the world is too complex for LLMs to eliminate all software engineering jobs in the near term.

Some software engineers are already being replaced, though. People who use AI to compose better software are starting to replace those who don’t. And these AI-augmented engineers will have more work than ever. This is because things that were previously too expensive to automate will now be cheaper to codify. Just think about all the weird manual processes you tolerate because there’s no good software for it. AI-augmented software developers and hobbyists will be able to tackle these areas with greater speed and ease.

In my own life, I’ve composed code with AI to complete various mundane tasks. For example, I wrote a Python script to export my Spotify playlists to another streaming platform instead of paying for some obscure software. I hacked together a full stack app to help me find meeting venues. I built an app to summarise YouTube videos instead of waiting for Google to make the feature. I’ve also experimented with AI agents that conduct market research, among other things.

The software engineer of the pre-LLM era is clearly dead. But long live the software engineer who augments their craft with AI. Just as high-level programming languages eliminated the need to write tedious low-level machine code, AI will free up software engineers to focus on more creativity and innovation—two things the world needs now more than ever.

Day 100 of 100 Days of AI 🎉

AI agents have been getting a lot of attention but they can be expensive to run. This is because they use a lot of tokens and make multiple calls to LLMs. Some AI developers have beat performance benchmarks by using exceptionally expensive agentic runs. However, this is unrealistic because in real-world business applications, costs matter.

This brings me to this new AI paper from researchers at Princeton University. The chart below shows what they have achieved. In a nutshell, the researchers argue that cost should be considered when running evaluations of AI agent systems. In addition, the paper highlights techniques that can produce high accuracy at a massively lower cost! See the chart below on accuracy versus cost.

A sample of techniques used to generate high accuracy at lower cost are listed below. They are also labelled on the dots in the chart above.

Retrying Strategy: This is where you set the temperature of a model to zero, and call it a fixed number of times if it keeps failing a specified test. You can do this with cheaper models like GPT 3.5, but call them repeatedly just a handful of times until you get a desired answer. Note that even when temperature is set to zero, these models are still stochastic.

Warming Strategy: Here you do the same as the retry strategy, but you incrementally increase the temperature at each try. You go from temperature 0 to 0.5. Tuning up the temperature (aka randomness) increases the variety of answers and could lead to a successful result quicker and more cheaply.

Escalation Strategy: You start with a cheaper model and escalate to more intelligent, expensive models only if the first prompt attempts fail. This reserves resources and until the cheaper models fail.

The paper has more on the topic and it’s worth reading in full if you’re building real-world agentic applications.

Day 99 of 100 Days of AI

I switched one of my AI agents projects from GPT-4o to the new Claude 3.5 Sonnet model. The agents now run with a cost that’s roughly 85% cheaper and the final results arrive about 15% faster.

To upgrade my AI agents project, all I had to do was write the code below. This is one of the great advantages of writing apps where models are interchangeable. Every few months, we get better, faster, and cheaper upgrades. Moreover, developers can benefit from new AI intelligence with minimal code changes.

Day 98 of 100 Days of AI

On the way home today I was listening this podcast — “Lessons from a Year of Building with LLMs“. The biggest takeaway? While everyone has been building lots of experimental projects with LLMs and often shipping to production scrappy builds of seemingly impressive demos (I’m guilty of this!), few people are running effective evaluations (“evals”) of their LLM apps.

To build LLM apps that can be taken seriously and used reliably, it’s important to spend time understanding what performance metrics matter and how to improve them. To do that, you have to plan and implement evals early on in the process. You then need to continuously monitor performance.

One simple example the podcast guests gave was that if you are building a LLM app that should respond with 3 bullet points each time, you can build tests or a metric that tracks what percentage of responses adhere to 3 bullet points. (Note: LLMs are probabilistic and so they might not hit 100%). This enables you to catch errors and iteratively improve your app.

Day 97 of 100 Days of AI

AI Agents are a mixed bag. I tried this morning to get a group of CrewAI agents to complete some Youtube summaries and they kept failing or going off the rails. My deterministic version which I hard coded works significantly better than leaving things to a group of autonomous agents.

The issue, it seems, is that I tried to delegate too much to the agents in this instance. I ask them to Google around to find Youtube videos instead of using the Youtube API, for example. And I had also left it up to them to decide what kind of search query to conduct on a specific video. This is a small example of cases where agents aren’t yet smart enough to figure things out on their own. That said, it’s quite easy to build guardrails and guidance to improve results. I just expected a bit too much, too early on.

Day 96 of 100 Days of AI

There’s an article in the Wall Street Journal I came across today with the headline below.

The article ends with a quote from a Google Cloud Chief Evangelist, Richard Seroter: “If you don’t have your data house in order, AI is going to be less valuable than it would be if it was…You can’t just buy six units of AI and then magically change your business.”

Gen AI applications have strong RoI in places like code, text, and media generation. But when it comes to making sense of data or using AI agents to drive critical business decisions, we should hold scepticism since these generative models still lack strong reasoning capabilities.

Day 95 of 100 Days of AI

Claude AI is really good at coding. In one shot, I asked it to create streamlit app that I can use to chat to my PDFs with a local, offline, large language model. The first attempt had a roughly working version and with just 2 more follow-on prompts and refinement, I had a fully working RAG application on my machine that doesn’t need the internet. It’s slow but works.

It’s insane how rapidly you can prototype software applications ideas with the help of LLMs, especially if you’re not a professional software developer.

Below is the prompt I gave to Claude.

And here’s the working streamlit app.

Day 94 of 100 Days of AI

Just 18 months ago, OpenAI released GPT-3.5 Turbo which had double the input token context window of its predecessor, GPT-3. We went from 2048 tokens to 4096 tokens and that felt like a significant leap. But today, we are enjoying context windows of 128,000 tokens with GPT4o.

How much further can we go? Today, I was perusing this Google paper and it turns out that their research team can achieve a 10 million token context window Gemini 1.5! Not just that, but as you can see in the charts below from the June 2024 update of the paper, the model achieves almost perfect recall across the equivalent of 7 million words or up to 107 hours of audio or 10 hours of video. These are incredibly impressive results!

Day 93 of 100 Days of AI

A couple of experiments today: I completed a simple tutorial that helps you build a RAG app using local models. I also installed and tested https://openwebui.com/. This is a really cool open source project that gives you a ChatGPT-style interface to use with large language models that run locally and offline on your machine. You can see an example of it below.

Running LLMs locally doesn’t generate the same performance as the big cloud-based models. However, if you’re travelling and don’t have internet access, it’s a great hack!

From Openweb-UI, Github