Stop Worrying About Privacy When Using LLMs

Questions you ask ChatGPT are not going to become public

Jun 16, 2025

One of the excuses people regularly give me for not using LLMs (like ChatGPT, Gemini, Claude, Grok) is privacy concerns. This is a shame because most of these concerns are overblown and spring from misunderstandings of how this technology works.

For personal use, if you use Gmail and Google Docs and Office 365, I think there’s absolutely no reason you shouldn’t be using LLMs. The privacy concerns are similar. For company use, you should, of course, follow company policy. But if you’re in a position to influence the company policy, you should be pushing hard for maximal use of LLMs. If you use AWS or Azure or any other SaaS offerings, there’s no reason not to use LLMs. Unless your company happens to be in a regulated sector, like finance (because in that case, there are government regulations that prevent you).

Why is Privacy Not a Concern with LLM Use

The rest of this post gives the reasons why.

Let’s start with the worst case: Suppose you have lots of private data in your ChatGPT account, either as part of the conversations or as documents you uploaded. And if OpenAI uses that as part of the training data for their next ChatGPT model. And if someone asks that new model a question about you or your situation. The probability that ChatGPT’s answer contains some of your private data are almost non-existent. And you can make this probability 0 by just going into the Settings of your ChatGPT/Gemini/Claude account and disabling the option which allows them to use your data to train future models.

If you are extra paranoid, just go into the Settings for your ChatGPT account and disable this option. Similar options exist for all the major LLMs.

That’s it. You now have similar privacy guarantees that you’re getting from Gmail, Google Docs, Microsoft Office 365, Azure Cloud, Amazon Web Services, etc.

LLMs Can’t Memorize Specifics From Their Training Data

Even if you allow OpenAI to use your data for training its models, your private data is extremely unlikely to leak. This is because an LLM is not like Google’s search engine. It does not, and can not, remember all the information in its training data. Training an LLM model consists of making it look for common patterns in this data and then just memorising these patterns as a series of weights in a neural network. After the training is complete, the LLM model does not have any access to the training data. So, when ChatGPT is answering someone’s questions, it does not have direct access to your data, even if it was trained on your data. What are the chances that one of the patterns it understood and encoded in its weights is your private data? Negligible. The amount of raw training data for a modern LLM is anything from a 10s to 100s of terabytes, equivalent to a few million volumes of Encyclopedia Britannica. What fraction of that will be your data, even if you spend all your time chatting with ChatGPT? Some texts, like Shakespeare’s works, or the Bible, or popular Hindi movie songs, are repeated many, many times in the training data because so many different sources include it. In this case, there is some chance that the LLM model memorises the exact text because it sees it as a pattern to be encoded in its weights. And yet, LLMs can’t even do that. (See example below.) Your private data is such a tiny, tiny fraction of the training data that there’s no chance of it getting reproduced by the LLM. The chances that your private data will be leaked by your bank, your hospital, your government office, and your friends who post on social media is far far higher.

This is Claude 4 Sonnet’s attempt at reproducing the most iconic dialogue ever. And you can see how badly it has been mangled. This is in spite of the fact that this dialogue probably occurs in its training data 50000+ times. From this you can estimate what are the chances that your non-iconic private data, which occurs in the training data only once, will be reproduced by an LLM. The chances of your hospital, your bank, and your government office leaking the data directly are far far higher. So stop worrying.

And, just to drive the point home, I will repeat: Even this level of leakage of your data can be prevented by simply changing the settings which allows them to use your data for training purposes.

Here are some more clarifications of common misconceptions.

LLM Models Are Not Updated in Real Time

When you have a conversation with ChatGPT, it is not “learning” from your responses. The actual model remains unchanged during the conversation.

Some people might think that this is clearly false. For example, if you tell ChatGPT that you prefer answers to be in really simple English, it not only listens to you but also remembers this in the future. How is this possible if it is not “learning”?

Generated image — When you have a conversation with ChatGPT.com (the website), chatgpt-4o (the model_) is not changing, and is not “learning” anything, and is not updating itself based on your responses.

When you type anything in the text box on the chatgpt.com website, it sends your “query” to the chatgpt-4o model running on another server. However, it doesn’t send just what you typed. In addition, the entire conversation until now is attached to your “query”. In addition, a small summary of all the previous conversations you’ve had with ChatGPT, which are relevant to this “query” is also attached. And chatgpt-4o, the model is told, “The user just typed <zzz>. Respond keeping in mind that this is the conversation so far <xxx>, this is a summary of relevant earlier conversations <yyy>.” The model itself isn’t changing. What gets sent to it as “context” keeps changing, and obviously, this is private to you. Other people using ChatGPT will get context from their own accounts.

But OpenAI Has My Private Data and Can Misuse It

So does Google. And Microsoft. And Dropbox. And at least 20 other companies. There’s no reason to single out AI companies.

I still remember the days when cloud computing was new and most companies were afraid of putting their data in the cloud because of privacy concerns. Companies that were slow to adopt the cloud because of this suffered competitively. The same thing will repeat now, but with higher stakes.

But, but …

Here’s an example of ChatGPT giving me the email address for a company. This is not a famous company, so there’s no reason for it to be prominent in the training data. And if what I said earlier is true, how was it able to reproduce the email address exactly?

How did ChatGPT do this if it doesn’t remember anything from its training data?

Simple. In this case, chatgpt-4o (the model) realised that it cannot answer this question. So it recommended that a web search should be done. At this point, ChatGPT (the website)1 ran a web search, picked the 10 or 15 most relevant web pages, and attached the contents of those pages to the context that it sent to chatgpt-4o with the query “The user has asked ‘Find me the email address of reliscore.com’. Here are the contents of webpages that might be relevant. Answer the question based on this information.”

Two important points to note: chatgpt-4o the model did not have this specific, obscure information, because (as we discussed earlier) it cannot. This information was provided to it from public sources. No private data was harmed in the making of this answer.

I’m oversimplifying here. The website itself isn’t doing this. A program in OpenAI’s data center is doing all this. The important point is that a program other than the chatgpt-4o model is controlling what context gets sent to the model. It attaches all the information from the current conversation, summary of relevant earlier conversations, and summaries of relevant websites.

Anshul Khare

Jun 21

An indirect privacy risk that companies are worried about is their employees running AI generated code and not being aware that a malicious library is siphoning off the sensitive information out from the employee's system. This risk was there earlier too but the guardrails were in place e.g. vetting the libraries that are allowed to be used. Now the surface area has increased because even non-tech staff can run AI generated code.

Expand full comment

1 reply by Navin Kabra

1 more comment...

AI IQ

Discussion about this post