← Writing
8 min read

Taming a Wild EC2 Fleet: A Practical Guide

You open the AWS console for the first time at a new job. There are 47 EC2 instances running. Eleven of them are named instance-1 through instance-11. Three are named test. One is named DO NOT TOUCH. Nobody knows what any of them do.

This is not a hypothetical. This is just what happens when infrastructure grows organically without governance. Someone spun up an instance to test something in 2021, it turned out to be load-bearing, and now it quietly processes critical data while everyone avoids looking at it directly.

I've inherited exactly this situation. Here's what I did to make sense of it, without breaking anything in the process.

The first rule of inheriting someone else's cloud: assume everything is load-bearing until proven otherwise.

Start With an Audit, Not a Cleanup

The instinct when you see chaos is to start cleaning. Resist it. Your first job is observation, not action. You need to understand what exists before you can make any decisions about what should continue to exist.

The fastest way to get a picture of your fleet is through the AWS CLI. Run this and pipe it to a file you can actually read:

bash
# Get all running instances with their names, IDs, types, and launch times
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].{
    Name: Tags[?Key==`Name`]|[0].Value,
    ID: InstanceId,
    Type: InstanceType,
    Launched: LaunchTime,
    IP: PrivateIpAddress
  }' \
  --output table

Do the same for stopped instances. Stopped instances still cost money for their attached EBS volumes, and they're often the forgotten half of a fleet — spun down during an incident and never cleaned up.

Build a living inventory

Paste that output into a spreadsheet and add four columns: Owner, Purpose, Last Verified, and Safe to Terminate. Leave the last column blank for now — you won't know the answer for a while, and that's fine. The goal of this phase is documentation, not decisions.

Pro tip

Check CloudWatch metrics for each instance. An instance with zero CPU activity for 30+ days is a strong candidate for either termination or investigation. Go to CloudWatch → Metrics → EC2 → Per-Instance Metrics and sort by CPUUtilization.

Establish a Tagging Standard Before Touching Anything

Tags are the only thing standing between you and another 47-instance mystery six months from now. Before you do any cleanup, define a tagging standard and apply it to everything — even the instances you don't fully understand yet.

A minimal tag set that actually helps:

You can apply tags in bulk from the console using the Tag Editor (Resource Groups → Tag Editor) or via CLI. Either way, block off time and do it systematically — skip one instance and it becomes the new DO NOT TOUCH.

bash — apply tags to an instance
aws ec2 create-tags \
  --resources i-0abc123def456 \
  --tags \
    Key=Name,Value=api-processor-prod \
    Key=Environment,Value=prod \
    Key=Owner,Value=platform-team \
    Key=Purpose,Value=processes-inbound-api-queue

Identify the Actually Dangerous Ones

Some instances in a wild fleet are genuinely critical and have no redundancy. Before any cleanup work, you need to find them. The signals to look for:

For each one you identify as potentially critical, your job is to find out what breaks if it goes away — before it actually goes away at 2am on a Tuesday.

The instance named "DO NOT TOUCH" will always be the most important one in the fleet.

Terminate Carefully, In Waves

Once you have a reasonable inventory and a tagging baseline, you can start the actual cleanup. The key principle is to terminate in small batches with observation windows between them, not all at once.

My approach: stop the instance (don't terminate — stop first), wait 48 hours, watch for alerts or complaints, then terminate. If you're wrong about it being safe to stop, you can start it back up. Once you terminate, the EBS volume is gone unless you snapshotted it.

Create a snapshot before terminating anything you're unsure about

bash — snapshot before terminating
# Get volume IDs attached to the instance
aws ec2 describe-instances \
  --instance-ids i-0abc123def456 \
  --query 'Reservations[*].Instances[*].BlockDeviceMappings[*].Ebs.VolumeId'

# Create a snapshot with a descriptive description
aws ec2 create-snapshot \
  --volume-id vol-0abc123 \
  --description "pre-termination snapshot — api-processor-prod — april 2026"

What Normal Looks Like After

After going through this process, a healthy EC2 fleet has a few properties: every instance has meaningful tags, you can look at any instance in the console and immediately understand its purpose, stopped instances are either intentionally stopped or don't exist, and you have a runbook somewhere that describes what each critical instance does and how to recover it if it goes away.

You won't get there in a week. With a large inherited fleet, realistically you're looking at a few months of incremental work alongside your normal responsibilities. The goal isn't perfection — it's getting to a state where the next person to open the console doesn't feel the same dread you felt on day one.

The instance named DO NOT TOUCH? I finally figured out what it does. It's since been replaced by a Lambda function. But I still kept the name in the new function's description, as a tribute.

Written by
Rajen Tandel

Software developer based in Reston, VA. I work on Angular applications, AWS infrastructure, and agentic AI systems. I write about the practical side of engineering — the stuff that doesn't make it into the docs.