Day 82 of 100 Days of AI

I have a 7 billion parameter large language model running locally on my Macbook thanks to Ollama!

This is great when travelling and there’s no Wifi to access ChatGPT or other large language models. What I can run on my Macbook today is not as powerful as GPT4o but it’s still handy for quick queries.

In the example below, I gave the Mistral 7B model a trial with a small task: create a simple Tic-Tac-Toe game in Python that I can run from my command line terminal. Below is the code it provided.

I had to fix 2 small errors and it worked almost immediately. The game logic isn’t quite right (wins are announced one move too late) but the program was broadly in the right direction.

Below is the output of the game while its running.

In the near future we are going to be able to run significantly more powerful large language models locally, all without an internet connection. This will be driven by a variety of model optimisation techniques (see my post about Apple’s work, for example) and improvements in computer hardware.

Day 81 of 100 Days of AI

The Plateau of Generative AI?

Dr Mike Pound – Youtube

There are many people in Generative AI circles who believe that we can get to God-like AI by training models on increasingly larger swaths of data. Their view is that if we can train gargantuan models on all of the world’s data, we’ll have artificial general intelligence (“AGI”) that can excel at any task.

Thanks to this video from Dr Mike Pound (and this research paper he discusses), I’m increasingly sceptical that existing approaches of pumping more data into models will get us to AGI. Why? Here are the key empirical points I took away from Dr Pound’s video and the research article on the topic:

Diminishing Returns to Data — Researchers are finding that you need exponentially more data to get incremental improvements in performance. There’s a robust “log-linear” scaling trend. For example, imagine that you can train a model on 100 examples of a task to get a performance score of 15%. To get a 30% score you would need 100^2 (i.e. 10,000 examples). To get a 45% score you would need 100^3 examples (i.e. 1 million examples) and so forth. Is it really possible to find exponentially more data in pursuit of incremental performance gains? At some point, you will get diminishing and possibly negative returns.

Rare Things are Pervasive But Not in the Data — The world is full of ideas, concepts, events and tasks that don’t appear in training datasets that much. It’s incredibly tough to get lots of examples of something rare. Yet, if a model hasn’t seen enough of a concept it will underperform against it. Consider the number of scenarios you might encounter while driving a car on a road with other human drivers. No amount of data collection and model training can ever capture all possibilities on a road. This is why we don’t yet have fully autonomous self-driving cars. It’s also why existing AI techniques might not ever get us there.

Figure 6 from the Paper: “No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance”

AI Models Today Struggle With Rare & Nuanced Concepts — Even if you train a model on all the data in the world, that data will have a long-tail distribution. In other words, a few things will appear a lot and many things will appear infrequently. For example, if you ask a Gen AI model to produce an image of an aircraft, a worm, or a bird, you’ll get lots of good generic results. But try asking it to produce an image of Piaggio Avanti P180 aircraft, or some obscure object or animal. Current AI models fail at this. The authors of this paper found that 40 Gen AI models consistently underperformed on more nuanced data (the “Let-it-Wag” dataset) versus a broader dataset (e.g. ImageNet). You can see an example I run below with Chat GPT4o (using Dall-E 3) versus a real image from Wikipedia.

Above: Dall-E 3 Image of a Piaggio P.180 Avanti
Above: Real-life Image from Wikipedia of a Piaggio P.180 Avanti

What does this all mean for the future of Gen AI?

Spending more money on data and compute is going to have limits. We are probably in a Gen AI bubble and once the performance of these models plateaus, we might see a market correction on how much money goes into the technology.

But there’s hope for more progress. AI researchers are experimenting with new techniques and algorithms. Perhaps a fundamentally new architecture will get us beyond what current generative AI techniques can achieve.

For now though, I don’t think we are on fast path to AGI. But then again, there’s no expert consensus on the matter so only time will tell.

Ps. Here’s a good video that lays out why large language models probably won’t turn into AGI.

Day 80 of 100 Days of AI

I did two things today.

First, I got a Youtube summarizer to work. I followed a simple tutorial here. I will create an agent tool out of this, and also try and build a RAG process around it.

Second, I watched this lecture on “The Future of AI from the History of Transformer.” It’s by Hyung Won Chung, a research scientist at OpenAI who previously worked at Google Brain.

The key points of the talk stem from this chart in the presentation.

The dominant force driving progress in AI today is cheap computer power. The cost of computing is falling exponentially!

This force is so powerful that it reduces the need for overly complex AI algorithms. You can scale up models with cheaper compute and more data, producing excellent results even with simpler modelling methods that don’t rely on complex assumptions or inductive reasoning.

The practical implication is that since a ton of cheap compute enables simpler AI architectures that outperform their more complex counterparts, AI researchers should take advantage of this trend rather than try to be too clever.

For example, decoder-only models like GPT-3 outperform Google’s T5 encoder-decoder model. This isn’t to say that advanced algorithms that make lots of inductive assumptions should be discarded. Pruning them for simplicity and, in turn, more generalisability can be a powerful technique if you have the compute to train models on lots more data.

So, as compute gets cheaper and more abundant, focusing on scalable models with fewer built-in biases becomes increasingly important. This approach not only takes advantage of the current trajectory of how cheap computation is getting, but it also prepares our latest models for even greater compute efficiencies in the future!

Day 79 of 100 Days of AI

I had very little time today, but gave a quick glance to the custom tool tutorial for CrewAI. I basically want to figure out how to create my own Youtube video summary tool, since the default Youtube loader with CrewAI doesn’t work reliably well. (Or I can’t figure out how to use the default tool properly!)

Either way, learning how to create custom tools will be crucial if I want to use AI agents more dynamically. I have some time this weekend and will explore this in more detail.

Day 78 of 100 Days of AI

Apple will be introducing large language models to their devices later this year. They have a high-level write-up here on the technical bits of how they got to where they are.

The post includes detail on how they conducted pre-training, post-training, optimization, and dynamic model adaptation. Some key bits I took away from reading the post are:

  • Size — The models are small and will fit into powerful smartphones. For example one of Apple’s on-device models is a ~ 3 billion parameter language model. For comparison, Meta’s latest flagship model has 70 billion parameters.
  • Fine-tuned – The on-device models are fine-tuned and specialised for a set of common use-cases (e.g. text summarisation, image generation, and in-app actions). This means you don’t really need supersized models.
  • Smart optimization – Apple has done a lot of smart work to make on-device models exceptionally efficient. On an iPhone 15 Pro, they were able to get time-to-first token down to just 0.6 milliseconds per prompt token (for compraison, GPT 4 achieves 0.64 and GPT 3.5 Turbo achieves 0.27)
  • Server-based models – For more difficult tasks, the phone can rely on server-based models that run on “Private Cloud Compute.”

Here is a sample of some benchmarks that Apple shared. It’s impressive that the on-device performance beats other larger models. But of course, these are Apple’s own benchmarks and it’s possible (though not necessarily true) there might have been some cherry picking to get the best numbers.

Overall, Apple has achieved promising results. We can expect even better performance of their on-device models in the years ahead.

Day 77 of 100 Days of AI

Building and running AI agents is a messy process. I completed a draft crew of agents that can take a product name and provide summary of Youtube reviews. You can see a sample output below.

The first image is my terminal command, with the agents running.

Below is the final output, with links to Youtube videos in red.

I also run the agents (this time without source links) for a review of the BYD Atto 3 electric vehicle. The points below are helpful and are grounded from the agents doing research across a number of Youtube reviews.

The challenges I saw with this process though are:

  • Costs — Running a few experiments cost me $3.28 for about half a dozen attempts at running the agents fully.
  • Inconsistent outputs — Sometimes I get great outputs, and other times the agents fail completely or make up content.
  • Slow and inefficient — The agents for this review app can take up to a minute to run fully, and occassionally take routes that are not necessary.

All these issues can be fixed with more powerful models, but by that point, would we have to build the agents ourselves? Or, would we just leave a more powerful LLM to figure things out rapidly and at a lower cost? That’s the view I wrote about here and increasingly, I’m starting to believe it.

Day 76 of 100 Days of AI

Today was another day of AI reading as I was travelling a bunch. looked over the arguments for when to use retrieval-augmented generation (RAG) and when to fine-tune large language models. Both approaches have pros and cons but using them both can render some weaknesses irrelevant while not necessarily solving for cost. Here’s a brief take on the two points to consider:

  • RAG is cheap, fast, and is less prone to hallucinations (though it isn’t entirely hallucination-free!) However, you are still working with a generic underlying model that will lack the nuances of a particular niche or area of expertise.
  • Fine-tuning a LLM provides a new model that is more of a domain expert. It’s more likely to provide superior results within an area of specialism that you fine-tune the model on. However, fine-tuning a model is a more expensive and time-intensive process. Also, hallucinations are still an issue.
  • A blend of both strategies (a fine-tuned LLM with RAG) can be superior in terms of performance but it takes up more time and resource to achieve.

Day 75 of 100 Days of AI

I’m travelling today, so no code. On days of no-code, I read about AI. The topic today: what does a Machine Learning Engineer (MLE) do at work?

Hacker News has an insightful thread about this with comments from people in the industry who are MLEs. Below are a few highlights.

It turns out, a MLE’s job is mostly about ensuring you don’t have garbage-in / garbage-out with regards to the data you train your modes on. The actual work of fitting models to data is a smaller fraction of a MLE’s job.

Day 74 of 100 Days of AI

(Slide by Dr Matt Welsh)

I’ve been using large language modes to help me write code for the last 18 months. I’ve created apps, scraped data from websites, transcribed podcasts, amongst other things. A few years ago, this work would take me months of learning and effort. But LLMs have been a gamechanger. 

There’s a possible future where people like me (and many others who aren’t professional software developers) won’t have to write any code at all when creating programs. In that world, to create a program you would simply tell a large language model the app or automation you want and AI would create a program you could use regularly. That’s the future that Dr Matt Welsh paints in this convincing CS50 talk at Harvard that I watched today. 

Things might not play out exactly as he forecasts. However, in light of how good large language models are getting, and given my own experience of writing programs with AI, I can envision a world where significantly more people will be able to use computers in a way that was previously restricted only to people who know how to code. 

The full talk is available below and it’s worth watching. 

https://youtu.be/JhCl-GeT4jw

Day 73 of 100 Days of AI

Today I continued working on the Youtube review agent app project idea but I’m having issues with some of CrewAI tools. Once I resolve this I’ll have a better update.