Day 13 of 100 Days of AI

On Data.

One thing that’s clear as I work through this 100 day challenge is how important data is for training AI. We need lots of it, and apparently all the world’s data that’s available on the internet might not be enough for building more advanced models. That’s the lead of this article from the Wall Street Journal.

According to the article and some of the researchers interviewed, if GPT-4 was trained on 12 trillion tokens (the fragments of words that large language models learn from), GPT-5 might need 60-100 trillion tokens on the current trajectory. Even if we used all the text and image data on the internet, we’d still be 10-20 trillion tokens (or more) short of what’s required, according to AI researcher Pablo Villalobos.

It’s possible that we might eventually run out of high quality data to train more advanced AI models on. However — and I say this as a non-technical expert — I believe we’ll figure out ways of capturing more data and/or find a way to do more with less. This is something the WSJ article also considers.

Autonomous vehicles can generate an estimated 19 terabytes of data per day. A video game platform can do 50 terabytes daily. On a larger scale, weather organisations capture 287 terabytes of data a day.

There’s a ton of data out there. We just have to figure out how to capture more of it and make sense of it.