You open the AWS console for the first time at a new job. There are 47 EC2 instances running. Eleven of them are named instance-1 through instance-11. Three are named test. One is named DO NOT TOUCH. Nobody knows what any of them do.
This is not a hypothetical. This is just what happens when infrastructure grows organically without governance. Someone spun up an instance to test something in 2021, it turned out to be load-bearing, and now it quietly processes critical data while everyone avoids looking at it directly.
I've inherited exactly this situation. Here's what I did to make sense of it, without breaking anything in the process.
The first rule of inheriting someone else's cloud: assume everything is load-bearing until proven otherwise.
Start With an Audit, Not a Cleanup
The instinct when you see chaos is to start cleaning. Resist it. Your first job is observation, not action. You need to understand what exists before you can make any decisions about what should continue to exist.
The fastest way to get a picture of your fleet is through the AWS CLI. Run this and pipe it to a file you can actually read:
# Get all running instances with their names, IDs, types, and launch times aws ec2 describe-instances \ --filters "Name=instance-state-name,Values=running" \ --query 'Reservations[*].Instances[*].{ Name: Tags[?Key==`Name`]|[0].Value, ID: InstanceId, Type: InstanceType, Launched: LaunchTime, IP: PrivateIpAddress }' \ --output table
Do the same for stopped instances. Stopped instances still cost money for their attached EBS volumes, and they're often the forgotten half of a fleet — spun down during an incident and never cleaned up.
Build a living inventory
Paste that output into a spreadsheet and add four columns: Owner, Purpose, Last Verified, and Safe to Terminate. Leave the last column blank for now — you won't know the answer for a while, and that's fine. The goal of this phase is documentation, not decisions.
Check CloudWatch metrics for each instance. An instance with zero CPU activity for 30+ days is a strong candidate for either termination or investigation. Go to CloudWatch → Metrics → EC2 → Per-Instance Metrics and sort by CPUUtilization.
Establish a Tagging Standard Before Touching Anything
Tags are the only thing standing between you and another 47-instance mystery six months from now. Before you do any cleanup, define a tagging standard and apply it to everything — even the instances you don't fully understand yet.
A minimal tag set that actually helps:
- Name — Human-readable, specific.
api-gateway-prodnotserver-1 - Environment —
prod,staging,dev - Owner — Team or individual responsible
- Purpose — One sentence on what it does
- CreatedDate — When it was provisioned
You can apply tags in bulk from the console using the Tag Editor (Resource Groups → Tag Editor) or via CLI. Either way, block off time and do it systematically — skip one instance and it becomes the new DO NOT TOUCH.
aws ec2 create-tags \ --resources i-0abc123def456 \ --tags \ Key=Name,Value=api-processor-prod \ Key=Environment,Value=prod \ Key=Owner,Value=platform-team \ Key=Purpose,Value=processes-inbound-api-queue
Identify the Actually Dangerous Ones
Some instances in a wild fleet are genuinely critical and have no redundancy. Before any cleanup work, you need to find them. The signals to look for:
- Instances with elastic IPs attached — something is pointed at them by IP directly
- Instances in security groups that have inbound rules from other instances — they're in a dependency chain
- Instances with IAM roles attached — they're doing something with other AWS services
- Instances with high, consistent network traffic — something is talking to them constantly
For each one you identify as potentially critical, your job is to find out what breaks if it goes away — before it actually goes away at 2am on a Tuesday.
The instance named "DO NOT TOUCH" will always be the most important one in the fleet.
Terminate Carefully, In Waves
Once you have a reasonable inventory and a tagging baseline, you can start the actual cleanup. The key principle is to terminate in small batches with observation windows between them, not all at once.
My approach: stop the instance (don't terminate — stop first), wait 48 hours, watch for alerts or complaints, then terminate. If you're wrong about it being safe to stop, you can start it back up. Once you terminate, the EBS volume is gone unless you snapshotted it.
Create a snapshot before terminating anything you're unsure about
# Get volume IDs attached to the instance aws ec2 describe-instances \ --instance-ids i-0abc123def456 \ --query 'Reservations[*].Instances[*].BlockDeviceMappings[*].Ebs.VolumeId' # Create a snapshot with a descriptive description aws ec2 create-snapshot \ --volume-id vol-0abc123 \ --description "pre-termination snapshot — api-processor-prod — april 2026"
What Normal Looks Like After
After going through this process, a healthy EC2 fleet has a few properties: every instance has meaningful tags, you can look at any instance in the console and immediately understand its purpose, stopped instances are either intentionally stopped or don't exist, and you have a runbook somewhere that describes what each critical instance does and how to recover it if it goes away.
You won't get there in a week. With a large inherited fleet, realistically you're looking at a few months of incremental work alongside your normal responsibilities. The goal isn't perfection — it's getting to a state where the next person to open the console doesn't feel the same dread you felt on day one.
The instance named DO NOT TOUCH? I finally figured out what it does. It's since been replaced by a Lambda function. But I still kept the name in the new function's description, as a tribute.