Day 14 of 100 Days of AI

More on data.

I read a bit more about the data problem in AI after writing about it yesterday. People smarter than I am believe there are diminishing returns to training AI models and ever growing datasets. In this research publication, AI researchers share the following summary:

In layman terms (thanks to a ChatGPT translation) the above says the following:

“Our research looked into how well advanced models, trained on vast internet data, can understand and respond to new tasks without prior specific training. We discovered that these models, contrary to expectations, require significantly more data to slightly improve at these new tasks, showing a slow and inefficient learning process. This inefficiency persists even under varied testing conditions, including completely new or closely related data to their training. Moreover, when faced with a broad range of uncommon tasks, the models performed poorly. Our findings challenge the notion of “zero-shot” learning—where models instantly adapt to new tasks without extra training—highlighting a gap in their supposed ability to learn efficiently from large datasets.”

ChatGTP summary of the abstract from this paper.

Key takeaway:

  • Looks like just throwing more data at models will eventually reach some limit in terms of performance. At that point (or before it) we’ll need alternative techniques and new breakthroughs. That’s what AI expert Gary Marcus believes. As he notes in his latest newsletter, “There will never be enough data; there will always be outliers. This is why driverless cars are still just demos, and why LLMs will never be reliable.”