During my time supporting a US federal client, I led AWS optimization projects that reduced annual cloud spending by over $200,000, a 40% reduction in costs. But here's what made it remarkable: we didn't just cut costs. We actually improved performance, enhanced security, and built more reliable systems.
This wasn't achieved through magic or risky shortcuts. It was the result of systematic analysis, strategic implementation, and a deep understanding of AWS services. Let me show you exactly how we did it.
The Challenge: Growing AWS Costs Without Clear ROI
Like many organizations, US federal installations I supported had experienced rapid AWS adoption. Teams spun up resources quickly to meet mission requirements, but without centralized oversight, costs spiraled. We faced:
- Over-provisioned resources: EC2 instances sized for peak loads that rarely occurred
- Idle resources: Development and test environments running 24/7
- Unoptimized storage: Data sitting in expensive storage tiers unnecessarily
- Missing Reserved Instances: Steady-state workloads running on expensive On-Demand pricing
- No cost visibility: Teams didn't know what they were spending or why
The mission was clear: reduce costs significantly without compromising operational capabilities or security posture.
Strategy 1: Comprehensive Infrastructure Audit
You can't optimize what you can't measure. Our first step was gaining complete visibility into the AWS environment.
What We Did
- AWS Cost Explorer deep dive: Analyzed six months of spending patterns to identify the biggest cost drivers
- Resource inventory: Tagged every resource with owner, project, and environment metadata
- Utilization analysis: Used CloudWatch metrics to identify underutilized resources
- Reserved Instance analysis: Identified steady-state workloads perfect for RI commitments
Pro Tip
Use AWS Cost Explorer's hourly granularity to identify exact usage patterns. We discovered that 60% of our development EC2 instances could run on schedules rather than 24/7, immediately cutting those costs by 65%.
Key Findings
The audit revealed eye-opening patterns:
- 23% of EC2 instances had CPU utilization below 5%
- Dev/test environments consumed 35% of monthly spend but only needed 40 hours/week of uptime
- $47K annually spent on EBS snapshots older than 90 days with no retention policy
- 82% of steady-state compute workloads were running On-Demand instead of Reserved Instances
Strategy 2: Right-Sizing EC2 Instances
The most impactful quick win came from EC2 right-sizing. Many instances were massively over-provisioned based on "just in case" sizing decisions made months or years earlier.
The Process
- Collected 30 days of CloudWatch metrics for CPU, memory, network, and disk I/O
- Identified candidates: Instances consistently using less than 40% of provisioned capacity
- Calculated optimal sizes: Matched actual usage patterns to appropriate instance types
- Tested in dev/test first: Validated performance before production changes
- Implemented gradually: Changed production instances during maintenance windows
Real Example
We had a fleet of m5.2xlarge instances (8 vCPU, 32GB RAM) running web applications. Analysis showed average CPU at 12% and memory at 18%. We downsized to m5.large (2 vCPU, 8GB RAM) and saved $68,000 annually with zero performance degradation.
Results from Right-Sizing
- Reduced 47 instances by 1-3 sizes
- Annual savings: $89,000
- Average performance improvement: 3% (better instance utilization)
Strategy 3: Automated Resource Scheduling
Development, test, and staging environments don't need to run 24/7. We implemented automated start/stop schedules using a combination of Lambda functions and EventBridge rules.
Implementation Details
I created CloudFormation templates that deployed:
- Lambda functions to start/stop EC2 instances, RDS databases, and Redshift clusters
- EventBridge rules scheduled to run weekdays 7 AM - 6 PM (only when teams needed access)
- SNS notifications to alert teams before scheduled shutdowns
- Tag-based targeting using "Environment:Dev" and "AutoSchedule:True" tags
import boto3
import os
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
action = os.environ['ACTION'] # START or STOP
# Find instances with AutoSchedule tag
instances = ec2.describe_instances(
Filters=[
{'Name': 'tag:AutoSchedule', 'Values': ['True']},
{'Name': 'tag:Environment', 'Values': ['Dev', 'Test']}
]
)
instance_ids = []
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_ids.append(instance['InstanceId'])
if instance_ids:
if action == 'STOP':
ec2.stop_instances(InstanceIds=instance_ids)
else:
ec2.start_instances(InstanceIds=instance_ids)
return {
'statusCode': 200,
'body': f'{action} completed for {len(instance_ids)} instances'
}
Scheduling Results
- Automated 127 non-production resources
- Reduced non-production runtime from 168 hours/week to 55 hours/week (67% reduction)
- Annual savings: $52,000
Strategy 4: Storage Optimization and Lifecycle Policies
S3 and EBS storage costs were quietly consuming budget. We implemented intelligent lifecycle policies to automatically move data to cost-effective storage tiers.
S3 Lifecycle Policies
- Transition to S3-IA after 30 days for infrequently accessed data
- Move to Glacier after 90 days for compliance archives
- Delete old versions after 365 days for versioned buckets
- Intelligent-Tiering for unpredictable access patterns
EBS Snapshot Management
- Implemented Data Lifecycle Manager to automate snapshot creation and deletion
- Deleted 1,200+ orphaned snapshots from terminated instances
- Set 30-day retention for dev/test, 90-day for production
Common Mistake
Don't just delete old snapshots without understanding dependencies. We discovered several "backup" snapshots that were actually gold images for critical systems. Always audit before deletion.
Storage Results
- S3 costs reduced by 34% through lifecycle policies
- Eliminated $47,000 in unnecessary snapshot storage
- Annual savings: $38,000
Strategy 5: Reserved Instances and Savings Plans
After right-sizing and optimization, we had clear visibility into steady-state workloads. This made Reserved Instance planning straightforward and low-risk.
Our RI Strategy
- Conservative approach: Only committed to 60% of steady-state usage (reducing risk)
- 1-year terms: Maintained flexibility for mission changes
- Standard RIs: For known, stable workloads (databases, core applications)
- Compute Savings Plans: For variable instance families but steady usage
RI Results
- Purchased $180K in Reserved Instances
- Average discount: 42% vs On-Demand pricing
- Annual savings: $32,000 (with room to optimize further)
Strategy 6: CloudWatch Monitoring and Continuous Optimization
Cost optimization isn't a one-time project, it's an ongoing process. We implemented comprehensive monitoring to catch cost anomalies and optimization opportunities.
Monitoring Implementation
- AWS Budgets with alerts at 50%, 80%, and 100% of forecasted spend
- Cost anomaly detection using AWS Cost Anomaly Detection service
- Custom CloudWatch dashboards showing cost trends by service, team, and project
- Weekly cost reports sent to engineering leads
- Quarterly optimization reviews to identify new opportunities
Pro Tip
Set up AWS Cost Anomaly Detection early. It caught a misconfigured NAT Gateway that was generating $4,000/month in unexpected data transfer charges within 48 hours of the issue starting.
The Complete Savings Breakdown
Key Lessons Learned
1. Start with Visibility
You can't optimize what you can't measure. Invest in tagging, Cost Explorer analysis, and CloudWatch metrics before making changes.
2. Quick Wins Build Momentum
Resource scheduling delivered immediate, visible savings that got stakeholder buy-in for larger optimization projects.
3. Automate Everything
Manual processes don't scale and drift over time. CloudFormation, Lambda, and EventBridge made our optimizations sustainable.
4. Test Before Production
Every right-sizing change went through dev/test validation first. This prevented the one production outage that would have erased all credibility.
5. Make it Continuous
Set up monitoring and regular reviews. We discovered an additional $30K in savings during quarterly reviews six months after the initial project.
How You Can Apply These Strategies
Whether you're spending $5,000 or $500,000 monthly on AWS, these strategies scale:
- Week 1: Run a comprehensive audit using Cost Explorer and CloudWatch
- Week 2: Implement resource tagging and identify quick wins
- Week 3-4: Deploy resource scheduling for non-production environments
- Week 5-6: Right-size EC2 instances based on utilization data
- Week 7-8: Implement storage lifecycle policies and snapshot cleanup
- Month 3: Analyze RI/Savings Plan opportunities and commit
- Ongoing: Monitor, review quarterly, and continuously optimize
Need Help Optimizing Your AWS Costs?
I offer comprehensive AWS Health Check Audits that identify your specific optimization opportunities. In one week, we'll analyze your environment and deliver an actionable plan with projected savings.
Schedule a Free Consultation →Conclusion
Saving $200K on AWS wasn't about cutting corners or sacrificing capabilities. It was about understanding what we were paying for, eliminating waste, and optimizing for our actual usage patterns.
The best part? These optimizations improved our infrastructure. Right-sized instances performed better. Automated scheduling reduced security exposure. Better monitoring caught issues faster. We delivered more value while spending less.
Your AWS environment likely has similar opportunities waiting to be discovered. The question isn't whether you can save money, it's how much.