On the way home today I was listening this podcast — “Lessons from a Year of Building with LLMs“. The biggest takeaway? While everyone has been building lots of experimental projects with LLMs and often shipping to production scrappy builds of seemingly impressive demos (I’m guilty of this!), few people are running effective evaluations (“evals”) of their LLM apps.
To build LLM apps that can be taken seriously and used reliably, it’s important to spend time understanding what performance metrics matter and how to improve them. To do that, you have to plan and implement evals early on in the process. You then need to continuously monitor performance.
One simple example the podcast guests gave was that if you are building a LLM app that should respond with 3 bullet points each time, you can build tests or a metric that tracks what percentage of responses adhere to 3 bullet points. (Note: LLMs are probabilistic and so they might not hit 100%). This enables you to catch errors and iteratively improve your app.