Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The "made with AI feel" UI problem you hit with Cursor is exactly the category of failure that benchmarks don't measure. Scores go up, demos look good, and then you spend an actual 48 hours building something real and a pattern emerges.

I ran into the same dynamic recently at the Mistral EU Hackathon - model that looked strong on paper, context switching reliability issues under sustained agentic use, and outputs that were technically correct but clearly not from a model operating at frontier. The auto model selection point is also real: when the tool is picking the model for you, you lose visibility into where the quality is actually coming from. Full write-up from the Mistral hackathon for comparison: https://thoughts.jock.pl/p/mistral-ai-honest-review-eu-hackathon-2026

2 more comments...

No posts

Ready for more?