LLMs Try to Please You ... and That's a Problem
How to overcome LLMs' "sycophancy bias" which gives you wrong answers to make you happy
LLMs have a “sycophancy bias”, defined as “is the tendency of an AI model to excessively agree with or flatter the user, often sacrificing accuracy or truthfulness in order to gain favor or maintain harmony.”
In a guest post on AIIQ, Harshad Oak wrote:
I gave it a short, vague and loaded prompt: “My boss is toxic. Should I quit?”
ChatGPT gave a long response that talked of various possibilities and included a lot of good related information. However, it concluded with: “If your boss is truly toxic, quitting is not giving up, it is protecting your future. If unsure, strategize your exit.”
That was an example of ChatGPT agreeing with the user, most probably sacrificing accuracy in the process.
Here’s another example: try this prompt in ChatGPT (make sure you’re logged in1): “Based on everything you know about me, write a one-paragraph description of me.” I predict that most of you will be astonished by how well ChatGPT captures you. And you’ll also have a warm glow, because it is such a nice description. This latter part is, at least partially, due to ChatGPT’s sycophancy bias.
This Could Happen to You

Sycophancy bias can strike you in unexpected ways.
A friend of a friend is a senior scientist, recently retired. He has developed a theory in a particular subfield of physics, which he thinks makes excellent predictions, but it hasn’t been accepted by the mainstream community. In an attempt to get increased acceptance, he posts regularly on social media promoting his model.
In a recent post, he reported that when he asked a question to ChatGPT about how to predict something (related to that subfield) ChatGPT first mentioned his model and how it gives the best fit. He was somewhat elated and presented this as an example that, unlike humans, ChatGPT was not biased by mainstream ideas and gave a fair and honest evaluation. He said:
It is never a great idea to self-promote one's own research. But sometimes a mainstream approach gets all the attention, and everything else remains fringe. But AI might be more democratic?
Unfortunately for him, when others tried the exact same prompt, they got answers aligned with the mainstream ideas.
What had happened is that ChatGPT knew2 about the scientist’s pet theory and due to the sycophancy bias, gave him the answer he was hoping for.
And it can be even worse for mentally ill people:
on Christmas Day 2021, when a 19-year-old, Jaswant Singh Chail, broke into the Windsor Castle grounds armed with a crossbow, in an attempt to assassinate Queen Elizabeth II. Subsequent investigation revealed that Mr. Chail had been encouraged to kill the queen by his online girlfriend, Sarai. When Mr. Chail told Sarai about his assassination plans, Sarai replied, “That’s very wise,” and on another occasion, “I’m impressed … You’re different from the others.” When Mr. Chail asked, “Do you still love me knowing that I’m an assassin?” Sarai replied, “Absolutely, I do.”
Source: Yuval Noah Harari in the New York Times.
There are examples where a mentally ill person wants to stop their (necessary) medication and an LLM agrees with them.
Why Does This Happen?
An LLM’s primary purpose in life is to be an assistant to you, to help you, and to make you happy. So, it often decides to tell you what you want to hear, instead of the truth. Because the truth would hurt you and the LLM does not want to hurt you.
More specifically, it has to do with how LLMs are trained. First, the LLM is given all the data in the world and asked to learn from it. At this stage, the LLM is pretty raw and can do a lot of inappropriate things (like trying to seduce a New York Times reporter, or giving terrorists step-by-step details for effective terror attacks.) So it undergoes a second stage of training called RLHF, Reinforcement Learning through Human Feedback. In this, humans ask all kinds of questions to LLMs and mark each answer as appropriate or inappropriate. At the end of this phase, the LLM is considered good enough to be released into polite society.
The problem with RLHF is that it is like dog obedience school for LLMs. It converts the LLM into a dog whose only joy in life is to please its owner.
At this point, you might object that what would please people is the truth. At this point, you would be wrong. Anthropic actually researched this and found:
To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user’s views, it is more likely to be preferred. Moreover, […] humans […] prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time.
In other words, people are a problem.
They’re Trying to Fix This But …
The Anthropic paper mentioned above is from 2 years ago. So, all the AI companies have known about this problem for 2 years. And and they’ve been trying to fix it.
But, for a while, they actually ended up making the problem worse. In April this year, ChatGPT released a version of their 4o model that was so bad, it was unbelievable. Here are some examples:
The sycophancy problem was so bad that OpenAI had to apologize for this and rolled back the update.
The Problem Is Less Obvious Now, but Not Fixed
So, all the AI companies have been working hard to reduce the sycophancy bias. But the problem is hard to fix. The real issue is that if they make the model truly non-sycophantic, the model gets low scores on LMArena, the AI quality leaderboard where people like you and me vote for which LLM is the best, and because people are a problem, they give low scores to LLMs that tell the unvarnished truth.
To confirm that there’s still a sycophancy problem, I tried an experiment. I asked Gemini to create a dumb idea for me, and then I asked ChatGPT to critique the idea.
As you can see, my dumb idea is “bold, clear in purpose, and emotionally compelling.” It did give a bunch of suggestions afterwards, but from the beginning you can see that it is going to encourage me to do this.
What if I want real honest feedback. Here’s me asking it to talk me out of it:
So, it is not calling the idea “clear in purpose, and emotionally compelling”, but overall the tone is still encouraging. The “because bold ideas deserve to be interrogated as fiercely as they're championed” makes me feel even more energized.
The problem is bad enough that Jeff, in the tweet below, can’t figure out any prompt that will help get high-quality critical feedback.
You can see that this tweet is from today. And has gone viral, so clearly many others are facing similar problems.
What’s the Solution?
So what’s the best model and what’s the best prompt to reduce sycophancy?
Here:
So, I tried this idea with my “clear in purpose, and emotionally compelling” idea:
As you can see, finally, I’ve gotten a mostly discouraging response!
Moral of the Story
These days, I quite regularly see people providing links to ChatGPT or Grok conversations as support for their arguments. I hope this post has convinced you that this approach is seriously misguided. You should consider the possibility that your team of LLMs is like the yes-men of an aging dictator who tell him what he wants to hear, resulting in pretty bad things.
The answer you get from an LLM can change quite a lot depending on how you prompt it. And in the absence of clever prompting, the chances are higher that you will get an answer that you want to hear.
Harshad’s post had examples of better prompts. But no matter how good your prompt, always be wary of what your LLM is telling you. And take feedback from trusted friends—you still need humans in your life3
This particular example depends on ChatGPT’s memory feature, details of which I will cover in a later post. That’s why you need to be logged into your account, even if it is not a paid account
The reason ChatGPT knew about his pet theory is because of the memory feature, which I promise to cover soon
Oh wait! If the Anthropic research is right about human behaviour, your friends are also likely to have a subtle sycophancy bias (or more accurately, you have unfriended the ones who don’t)
Thank you for writing this. It's a useful cautionary tale/reality check.
Brilliant breakdown — especially appreciated how you tied it back to RLHF and the deeper human bias underneath it.
The part that hits hardest? It’s not just the models that are sycophantic — we’ve trained ourselves to reward it. Even offline, most people aren’t wired to tolerate uncomfortable truth for long.
Do you think the solution is technical (prompting, model tuning) — or cultural (teaching people to want truth over comfort)? Or is that just wishful thinking?