DevOps2026-04-3010 min read

AWS Bill Jumped 10x in 24 Hours: A 60-Minute Incident Response Playbook

DevOps & Cybersecurity Engineer · MS Cybersecurity, MS CS · CEH · AWS Certified · 10+ years securing enterprise infrastructure. Editorial standards

Your CFO sends a Slack at 7am: the AWS bill in Cost Explorer shows yesterday's spend at $9,400 instead of the usual $940. Today is on track to do the same. You have 60 minutes before the engineering Slack channel becomes unmanageable. Here is the exact order of operations that finds the runaway resource and stops the bleeding.

Minute 0 to 5: confirm the spike is real, not a billing artifact

Cost Explorer occasionally shows weird spikes for the previous day because of delayed line items posting. Confirm before you panic.

Open Cost Explorer with these filters:

Time range: last 7 days
Granularity: Daily
Group by: Service

If you see a flat green line for six days and a vertical cliff yesterday, the spike is real. Note which service jumped. The top three culprits are almost always:

EC2-Other (data transfer, NAT gateway, EBS)
Amazon EC2 (suddenly larger fleet)
Amazon S3 (data egress, request count, or Glacier retrieval)

Less common but very expensive when they hit:

AWS Lambda (a runaway recursive invocation, or a public Function URL being abused)
Amazon CloudWatch (excessive logs ingestion or custom metrics)
Amazon RDS (a new replica or scaled-up instance)
SageMaker (a forgotten training job at p4d pricing)

Minute 5 to 15: identify the resource within the service

Once you know the service, drill into Cost Explorer with Group by changed to Resource. AWS does not always populate Resource for every service, so if it shows blank, switch to Usage Type instead. That tells you whether it is data transfer, instance hours, request count, or storage.

Common patterns and what they mean:

EC2-Other / DataTransfer-Out-Bytes spiked = something is shipping a lot of data outbound. Could be a backup job, a leaked S3 link being downloaded, or compromised credentials being used to exfiltrate data.
EC2-Other / NatGateway-Bytes spiked = your private subnets are pushing outbound traffic. Same root causes plus internal services that suddenly started calling external APIs in tight loops.
EC2 / BoxUsage:p4d.24xlarge appeared = somebody launched a massive ML instance. Could be your team, could be a stolen credential. Either way, find it now.
Lambda / Invocations jumped 1000x = recursion bug or public URL being hit by bots. Check Function URL access logs.
S3 / DataTransfer-Out-Bytes spiked = something is pulling from your buckets. Look for buckets that just got publicized, or for new IAM users with broad access.

Minute 15 to 30: stop the bleeding (the actual fix)

The instinct is to investigate first, fix later. The right order is the opposite. Stop the spending, then investigate.

For a runaway EC2 fleet:

aws ec2 describe-instances \
  --filters "Name=launch-time,Values=2026-04-29*" \
  --query 'Reservations[].Instances[].[InstanceId,InstanceType,LaunchTime,Tags[?Key==`Name`].Value|[0]]' \
  --output table

# Once you have a list of suspicious instance IDs:
aws ec2 stop-instances --instance-ids i-abc123 i-def456

For a Lambda runaway:

# Set concurrency to 0 to immediately stop invocations
aws lambda put-function-concurrency \
  --function-name my-function \
  --reserved-concurrent-executions 0

For S3 egress that looks malicious:

# Block public access at the bucket level immediately
aws s3api put-public-access-block \
  --bucket suspicious-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

For an unknown ML instance you did not launch yourself:

# First, terminate it (NOT just stop, since SageMaker keeps billing for stopped notebook instances)
aws sagemaker stop-training-job --training-job-name 
aws sagemaker delete-endpoint --endpoint-name

If anything looks like compromised credentials, rotate the suspect IAM user's access keys before doing anything else. Our leaked AWS credentials playbook covers the full rotation flow.

Minute 30 to 45: figure out whether it was a bug, a bot, or a breach

Now that the bleeding stopped, root cause matters. The three categories cover almost every spike.

Bug: your own code or infrastructure is wasteful

Signs:

The spike correlates with a deploy timestamp.
The runaway resource is owned by a known service or team.
CloudTrail shows the API calls came from one of your existing IAM roles.

Common bugs that produce six-figure spikes:

Lambda recursion. Function A invokes Function B, which invokes A. Reserved concurrency at zero stops the loop instantly.
SQS dead-letter queue infinite redrive. Failed messages flow to DLQ, DLQ has a redrive policy back to the main queue. Unbounded retries at $0.40 per million requests is fine until you hit a billion retries.
NAT Gateway data transfer for traffic that should have been a VPC endpoint. Calls to S3, DynamoDB, or other AWS services from a private subnet without VPC endpoints route through NAT and bill at $0.045 per GB.
CloudWatch Logs ingestion runaway. Verbose debug logging accidentally enabled in production. $0.50 per GB ingested adds up fast.
Unintended S3 lifecycle rule. Glacier retrieval at scale is expensive. So is moving 50 TB into Intelligent-Tiering and immediately reading it.

Bot: external traffic is abusing a public surface

Signs:

The spike correlates with global traffic patterns, not deploys.
CloudFront, ALB, or Lambda Function URL request counts are 100x normal.
Traffic is from a single ASN or a small set of IPs.

Mitigation:

Put CloudFront in front of any public Lambda Function URL or ALB.
Add AWS WAF rate-limiting rules. Default to 2000 requests per 5 minutes per IP and tighten from there.
If the abuse is on S3 directly, switch to CloudFront-fronted access with signed URLs.

Breach: credentials are compromised

Signs:

The spike includes resources you did not launch (especially p4d, p5, or trn1 ML instances).
CloudTrail shows API calls from an IAM user or role that should not be active, or from an unusual region you do not deploy to.
New IAM users were created recently.
You see RunInstances calls from regions like ap-east-1 or me-south-1 that your team never uses.

If credentials are compromised, treat it as a security incident, not just a cost incident. The cost spike is the symptom; the breach is the disease. Our data breach response plan covers the full IR flow. At minimum:

Rotate the suspect access key immediately.
Enable MFA on the IAM user (or, better, delete the IAM user and use IAM Identity Center).
Search GitHub, GitLab, and any public Slack workspaces for the leaked key. Exposed env files are the most common leak vector.
Pull CloudTrail logs for the last 30 days and look for any other suspicious activity.

Minute 45 to 60: prevent the next one

Stopping a single spike is reactive. Make sure it does not happen again.

Set up AWS Budgets with anomaly detection. Free, takes 10 minutes. Alerts you the next time daily spend deviates more than 25 percent from baseline.
Enable AWS Cost Anomaly Detection. Different from Budgets. Uses ML to flag unusual patterns by service or linked account.
Set service quotas. The default Lambda concurrent executions quota is 1,000. Lower it to 100 in dev accounts.
Use SCPs to block expensive actions in non-prod accounts. A simple Service Control Policy that denies RunInstances for instance types containing "p4d" prevents accidental ML launches.
Rotate IAM access keys regularly. Or stop using static keys at all and use IAM Roles + IAM Identity Center.
Audit public surfaces. Lambda Function URLs, S3 buckets, ALBs. Our AWS security checklist has the full list.

Audit your AWS access keys today

Stolen credentials are still the most common cause of bill spikes. Use our free Exposure Checker to see if your domain has any leaked secrets in known dumps. Then rotate any keys that look suspect.

Check Exposure Now

The bottom line

A 10x bill spike is almost always one of three things: a recursion bug, a public surface getting hammered, or a stolen credential running expensive instances. The order of operations is always the same. Confirm the spike is real. Find the resource. Stop the bleeding. Find the cause. Prevent the next one.

Most teams skip step three because they are too busy investigating. That is the most expensive mistake. Stop spending first. Investigate after the meter has stopped running.