Prompt Engineering: A Surprising Switching Cost of Large Language Models

I've been working on some exceptionally long LLM prompts for a couple of projects at work. I've noticed a fascinating phenomenon: A prompt that works well with one model can diverge in performance when applied to another.
This presents switching costs for developers and businesses. You can spend hours or days crafting a carefully engineered prompt, but when you switch models, you might find that you need to invest significant time re-engineering your prompt to avoid performance degradation.
Here's a simple example I ran this morning comparing prompt responses from ChatGPT, Claude, and Google. I gave the LLMs the front page of a newspaper website and asked for a simple sentiment analysis. These responses are from the consumer chat apps, and API responses can vary. Still, it's surprising to see that Google's Gemini app refused to respond at all.
These models have a personality of sorts derived from whoever created them. That said, API access can give more consistent results since you can guide the LLM with system prompts to alter that "personality".
Note: A snippet of the Times page I extracted headlines from.

The LLM summary page: You can view the full version here.
