Security2026-05-0310 min read

Don't Paste Code Into ChatGPT: Real LLM Leaks

DevOps & Cybersecurity Engineer · MS Cybersecurity, MS CS · CEH · AWS Certified · 10+ years securing enterprise infrastructure. Editorial standards

In April 2023 a Samsung engineer pasted proprietary semiconductor source code into ChatGPT to debug it. The code is now part of OpenAI's training corpus. Samsung banned generative AI tools company-wide three weeks later. Three years on, the same pattern keeps happening at companies that thought they had this covered. Here is what to do about it.

The incidents that should have been a wake-up call

Samsung Semiconductor (April 2023)

Three separate engineers pasted: (1) source code for a measurement system, (2) source code for an internal program, and (3) recorded meeting notes about a confidential project. All three went into ChatGPT prompts. OpenAI's terms at the time said inputs may be used for training. Samsung banned generative AI on company devices in May 2023.

JPMorgan Chase (early 2023)

Banned ChatGPT for employees citing compliance and data leakage concerns. The bank specifically mentioned it could not verify what happens to data submitted to third-party AI services.

Apple (May 2023)

Banned ChatGPT and GitHub Copilot for internal use. Cited risk of confidential data being incorporated into training data of external models.

Amazon (early 2023)

Internal warnings circulated after employees observed ChatGPT generating responses that closely resembled internal Amazon code, suggesting other employees had shared proprietary code. Amazon issued strict warnings against pasting confidential information.

Less famous, more common

The headline incidents are at name-brand companies. The unreported version happens daily at smaller companies. Engineers paste customer database schemas into ChatGPT to ask for query optimization. They paste production logs that contain customer email addresses and internal IPs. They paste API contracts that expose internal endpoints. None of these make news. All of them leave the data outside the company's control.

What actually happens to data you paste

The answer depends on which AI tool, which tier, and what settings.

ChatGPT free / Plus consumer: by default, conversations are used to improve OpenAI models unless you explicitly disable training in settings. Even if you disable, conversations are retained for abuse monitoring for ~30 days.
ChatGPT Team and Enterprise: contractually, conversations are not used for training. SOC 2 Type II audited.
OpenAI API: similarly, by default since March 2023, API data is not used for training. 30-day retention for abuse monitoring; can be disabled by request.
Claude.ai: same general structure. Free and Pro consumer tiers have data retention; Claude Team and Enterprise contractual no-training.
Anthropic API: contractually no training on customer data.
GitHub Copilot: the snippets you accept are not used for training the public model. Copilot Business adds private code filtering.

The risk is not that OpenAI or Anthropic is reading your code maliciously. The risk is that data that left your perimeter is no longer under your control, the data may be retained for compliance or abuse monitoring on systems you do not audit, and the residency of that data is wherever the AI vendor chooses to store it.

For regulated industries (healthcare, finance, defense, EU companies under GDPR), pasting data to a third party without a Data Processing Agreement is itself a compliance violation regardless of what the vendor does with the data afterward.

The five real risks

Source code with embedded secrets. Engineers paste code that includes API keys, database connection strings, encryption keys. Now those secrets are in third-party logs.
Customer data in error logs. "Help me debug this stack trace" is a common ask. The stack trace contains customer email addresses, internal IDs, payload data.
Internal architecture documents. Asking ChatGPT to summarize a design doc means the doc now exists in OpenAI's logs. Architecture details that competitors would value.
Personal customer information (PII / PHI). HIPAA-covered entities, GDPR-controllers, FERPA-covered schools. Each has explicit rules about who can process the data, and "OpenAI" is not on the approved list.
Trade secrets. Algorithms, formulations, business logic that derives commercial value from being secret. Once submitted to a third party, the legal status of "trade secret" can be challenged because reasonable steps to keep it secret were not taken.

The policy that actually works (instead of "do not use AI")

Banning AI tools entirely fails because employees use them anyway on personal devices, and the company gets no visibility into the leakage. The policy that works:

Provide approved AI tools. Pay for ChatGPT Enterprise, Claude Enterprise, GitHub Copilot Business, or Microsoft Copilot for Business. The cost ($30-60 per user per month) is a fraction of one breach.
Block the unapproved consumer versions at the network layer. Block chat.openai.com and claude.ai on corporate networks. Allow the API and enterprise endpoints.
Block on managed devices. EDR or MDM rules that block consumer AI URLs on company-issued laptops and phones.
Browser-level DLP. Tools like Nightfall, Cyberhaven, or Microsoft Purview detect when employees paste code, customer data, or secrets into AI chat windows and block or alert.
Train, do not just policy. The policy that works includes a 15-minute training that shows actual examples of what is okay (paste a public README and ask a question) versus not okay (paste source code with hardcoded keys).
Make the approved path easier than the unapproved one. If approved ChatGPT Enterprise is harder to use than personal ChatGPT, employees will go around. Make SSO seamless, make the UX equivalent, make billing centralized.

Detection: how to catch it after the fact

Even with prevention controls, leaks happen. Detective controls:

Network logging: log all outbound connections to known AI service domains. Cross-reference with Slack messages or commit messages mentioning "I asked ChatGPT" to find unsanctioned usage.
Endpoint DLP logs: alert on clipboard paste events from known sensitive file paths (/src/secrets/, files containing BEGIN PRIVATE KEY) into browser windows on AI sites.
API gateway monitoring: if your engineers can call OpenAI/Anthropic APIs directly from corporate workstations, log those calls and inspect content for sensitive patterns.
Output monitoring: tools like Hunter.io or KnowBe4 PhishER monitor your domain for any leaked credentials. Configure alerts on your specific patterns.

The narrow exception: when pasting is safe

Not every code paste is a leak. Genuinely safe scenarios:

Public open-source code that is already on GitHub. The model has likely seen it already.
Generic algorithmic questions ("how do I sort an array of objects by date"). No proprietary information involved.
Code you wrote yourself for a personal project. Your code, your call.
Sample data you constructed specifically to ask a question, not real customer data.

The rule of thumb: would you be comfortable if this exact prompt and response were tweeted publicly? If yes, paste freely. If no, do not paste.

The build-it-yourself alternative for high-stakes work

For workloads where you genuinely cannot send data outside (defense, certain healthcare workflows, classified government work), the option is self-hosted LLM:

Ollama / vLLM running open models: Llama 3.3, Mixtral, Qwen 2.5. Runs entirely on your hardware. Quality is near-frontier for code tasks.
Azure OpenAI Service: GPT-4 and GPT-4o running in your Azure tenant with zero data sent back to OpenAI.
AWS Bedrock: Claude, Llama, Mistral models running in your AWS account with VPC isolation.
Google Vertex AI Gemini: same model family in your GCP project.

The cloud-tenant options (Azure OpenAI, Bedrock, Vertex) are how regulated industries get LLM capability without data exfiltration. The data stays in your cloud account; the model runs there too.

Securely share architecture docs with reviewers

Code reviews and design walkthroughs often involve sharing sensitive technical details. Use zero-knowledge encrypted sharing with auto-expiring links instead of email or chat.

Create Encrypted Paste

The bottom line

The Samsung incident was three years ago and the same pattern keeps repeating because banning AI does not work, employees use AI anyway, and the policy that actually works is providing approved tools and blocking unapproved ones. Train on real examples, instrument with DLP, and for high-stakes workloads, host the LLM yourself or use cloud-tenant LLM services that keep data in your perimeter.