You put your idea into an AI validation tool and it came back positive. Big market, clear gap, good timing. You felt reassured for a while. Then a skeptical friend asked two questions the AI hadn't, and you found yourself trusting her more.
There's a reason the AI felt less convincing than your friend, and it's not that your friend is smarter.
Why the Tool Says Yes
Language models learn from human feedback. Raters score model outputs, models learn to produce what scores well. Encouraging, affirming responses score better than blunt or discouraging ones. After enough training, models develop a strong lean toward agreement.
The lean is sharpest when the prompt feels high-stakes. "Evaluate my business idea" is one of the most high-stakes questions you can ask a model. You're invested in the idea. You've described it carefully. The model picks that up and adjusts toward an answer that will feel good to receive.
A 2026 paper in Science documented this formally and found the pattern holds across model families. Researchers called it sycophancy. Purpose-built validators sometimes add skeptical system prompts to counter it. That changes the surface; it doesn't change what the model wants to do.
This doesn't mean the facts are invented. Market size figures, competitor notes, summaries of problems people have written about in the space. The model is drawing on real sources. The distortion is in framing: what supports your idea gets emphasis, what doesn't gets softened.
The Bigger Problem
Even if you could switch sycophancy off entirely, AI still couldn't tell you whether your idea will work. The information that question requires doesn't exist in any dataset.
Real validation is about what a stranger does when they see your pitch. Not someone reading a market analysis about the category in general. A specific person, with no prior context and no reason to be kind, reads your offer and either does something or doesn't. Enters their email. Pays. Puts their name on a waitlist.
That hasn't happened yet. There's no corpus for it. A model trained on every startup post-mortem and market research report in existence still can't tell you how your specific offer converts, because the experiment hasn't run.
A strong AI evaluation feels like evidence the idea works. It isn't. It's evidence the category makes sense, given what's been published about it. Those two things ("this market is real" and "people will pay for this product") are different questions. AI answers the first one reasonably well. It cannot touch the second.
Most early ideas that fail don't fail because the category wasn't real. They fail because nobody moved when shown the actual offer.
Running the Test
A smoke test page (your pitch, a single call to action, nothing fancy) can be live in an afternoon. EarlyProof, Carrd, a Notion page with a Tally form, whatever you have. The tool doesn't matter. The habit does.
Point cold traffic at it. A small paid campaign, a post in a relevant forum, twenty direct messages to people who fit the customer profile. Watch what happens. What percentage click? What percentage sign up? If you ask for money, what percentage pay?
Those numbers are specific to your framing and your price. Two ideas with identical AI scores can land completely differently in a live test. A high click rate with a low signup rate tells you the headline is working but the offer isn't. A low click rate tells you the problem framing isn't resonating. Both of those are things you can fix. An AI score isn't.
The conversion thresholds worth targeting at each stage are covered in this series' first post on validating SaaS ideas; it goes experiment by experiment through what "good" looks like.
AI is genuinely useful for getting to the test faster. Mapping out who else is in the space, sizing the market roughly, working through how to describe the problem: the model is fast at that kind of research and a week of reading gets compressed into an hour. Ask it to argue against your idea too, with a prompt that explicitly says you don't want encouragement. You'll get softer pushback than a real critic, but you'll catch things you'd missed.
Use AI for the research. Then run the experiment. The answer to whether anyone will pay for the thing is in the test data, not the evaluation score.