☁️ Complete AWS for DevOps

AWS DevOps Guide

EC2, Auto Scaling, Load Balancers, VPC, Security Groups, NACLs, WAF, VPN, Route 53, S3, RDS, CloudWatch, ECS, Lambda, CloudFormation — all in simple terms.

Chapters

40+

Services

100%

Free

01☁️

Introduction to AWS

Cloud Computing for DevOps

AWS is the world's largest cloud platform with 200+ services. As a DevOps engineer, you don't need to know all 200 — you need to master about 15-20 core services that form the backbone of every production environment. This guide covers exactly those services in simple terms.

Core AWS Concepts — Think of It Like This

🌍

Regions & AZs

AWS has data centers in 30+ cities worldwide (regions). Each region has 2-3 separate buildings (Availability Zones) so if one building loses power, your app keeps running in the other.

💰

Pay-As-You-Go

Like electricity — you only pay for what you use. Turn off a server at night? You stop paying. No upfront cost.

🔐

Shared Responsibility

AWS secures the buildings, power, and network cables. YOU secure what you put inside — your data, your passwords, your firewall rules.

AWS CLI Setup

TERMINAL# Install AWS CLI curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o awscli.zip unzip awscli.zip && sudo ./aws/install # Configure your credentials aws configure # Access Key ID: AKIA... # Secret Key: ... # Region: ap-south-1 (Mumbai) # Output: json # Test: who am I? aws sts get-caller-identity

02🔐

IAM — Identity & Access

Who Can Do What in Your AWS Account

IAM controls access to your entire AWS account. Think of it as the security guard at the building entrance — checking IDs, giving visitor passes, and making sure nobody enters restricted areas without permission.

IAM Building Blocks

👤

Users

Individual people. Each gets their own username and password. Suresh, Priya, Rahul — each has their own IAM user.

👥

Groups

Collections of users. Create a group called "Developers" and put Priya and Rahul in it. Attach permissions to the group — all members get those permissions.

🎭

Roles

Temporary access for SERVICES (not people). An EC2 instance needs to access S3? Give it an IAM Role. No passwords needed — it just works.

📜

Policies

JSON documents that say "allow this action on this resource." Policies attach to users, groups, or roles.

IAM Policy Example

JSON{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject"], "Resource": "arn:aws:s3:::my-app-bucket/*" }, { "Effect": "Deny", "Action": "s3:DeleteBucket", "Resource": "*" } ] } // Translation: You CAN read and upload files to my-app-bucket. // You CANNOT delete any bucket. Deny always wins over Allow.

✓Never use the root account for daily tasks — create an IAM admin user

✓Enable MFA (Multi-Factor Authentication) on every IAM user

✓Use IAM Roles for EC2/Lambda/ECS — never put access keys inside code

✓Follow least privilege: give minimum permissions needed

✓Rotate access keys every 90 days

⚠️ Root Account

Root account has UNLIMITED power — can delete everything, change billing, close the account. Lock it away, enable MFA, and create an IAM admin user for daily work.

03🖥️

EC2 — Elastic Compute

Virtual Servers in the Cloud

EC2 gives you virtual servers (called instances) that you can launch in minutes. It's like renting a computer in Amazon's data center. You choose the size (CPU, RAM), the operating system (Amazon Linux, Ubuntu), and you're running in 30 seconds.

Instance Families — Which One to Pick

Family	CPU/RAM	Best For	Example
t3/t4g	Balanced, burstable	Web servers, small apps, dev/test	t3.micro (1 vCPU, 1 GB)
m5/m6i	Balanced, steady	App servers, backend APIs	m5.xlarge (4 vCPU, 16 GB)
c5/c6i	CPU-heavy	Batch processing, CI/CD agents	c5.2xlarge (8 vCPU, 16 GB)
r5/r6i	Memory-heavy	Databases, caches, in-memory apps	r5.xlarge (4 vCPU, 32 GB)
g4dn	GPU	Machine learning, video encoding	g4dn.xlarge (4 vCPU, GPU)

Pricing Models — Save Up to 90%

💵

On-Demand

Pay per hour/second. No commitment. Most expensive but most flexible. Use for dev/test and unpredictable workloads.

💰

Reserved Instances

Commit for 1 or 3 years → save up to 72%. Best for production servers that run 24/7. You KNOW you need this server for a year.

🏷️

Spot Instances

Bid for unused capacity → save up to 90%. BUT AWS can take it back with 2-minute notice. Perfect for batch jobs, CI/CD builds, data processing.

📋

Savings Plans

Commit to spending $X per hour → flexible across instance types. Simpler than Reserved. Recommended for most companies.

Key Pair & SSH

TERMINAL# Create key pair (do this ONCE) aws ec2 create-key-pair --key-name my-key --query 'KeyMaterial' --output text > my-key.pem chmod 400 my-key.pem # Launch instance aws ec2 run-instances --image-id ami-0c55b159cbfafe1f0 --instance-type t3.micro --key-name my-key # SSH into your server ssh -i my-key.pem ec2-user@<public-ip> # Check instance status aws ec2 describe-instances --filters "Name=tag:Name,Values=web-server"

💡 AMI

Amazon Machine Image = a snapshot/template of a server. Like a ghost image. Create an AMI of your configured server → launch 100 identical copies from it. This is how auto-scaling works.

04💾

EBS — Elastic Block Store

Hard Drives for Your EC2 Servers

EBS volumes are the hard drives attached to your EC2 instances. When you stop and start an EC2 instance, your data on EBS survives (unlike the instance store which is temporary). Think of EBS as a USB external hard drive that you can plug into any server.

EBS Volume Types

Type	Speed	Cost	Best For
gp3 (General Purpose SSD)	3000-16000 IOPS	Low	Default choice — boot volumes, apps, databases
io2 (Provisioned IOPS SSD)	Up to 64000 IOPS	High	Mission-critical databases (Oracle, SQL Server)
st1 (Throughput HDD)	500 MB/s max	Very Low	Big data, log processing, data warehouses
sc1 (Cold HDD)	250 MB/s max	Cheapest	Archival, infrequent access

EBS Snapshots

A snapshot is a backup of your EBS volume stored in S3. You can create new volumes from snapshots — even in different regions. This is how you do disaster recovery and migration.

$ aws ec2 create-snapshot --volume-id vol-12345 --description "Daily backup"Create snapshot

$ aws ec2 create-volume --snapshot-id snap-12345 --availability-zone ap-south-1aCreate volume from snapshot

💡 Interview Fact

EBS volumes exist in ONE Availability Zone. If that AZ goes down, the volume is unavailable. Solution: take regular snapshots (stored in S3 across AZs) and recreate volumes from snapshots if needed.

05📈

Auto Scaling Groups

Automatically Add/Remove Servers

Auto Scaling Groups (ASG) automatically add more EC2 instances when traffic increases and remove them when traffic drops. Imagine a restaurant that automatically hires more waiters during dinner rush and sends them home when it's quiet. You never pay for idle servers.

How ASG Works — Step by Step

✓Step 1: Create a Launch Template — defines what type of instance to create (AMI, instance type, key pair, security group)

✓Step 2: Create Auto Scaling Group — set minimum (2), desired (2), maximum (10) instance counts

✓Step 3: Attach to Load Balancer — new instances automatically register with ALB

✓Step 4: Set Scaling Policies — rules that say "when CPU > 70% for 5 minutes, add 2 instances"

✓Step 5: Done! ASG monitors, adds, removes, and replaces instances automatically

Scaling Policies

Policy Type	How It Works	Example
Target Tracking	Keep a metric at a target value	Keep average CPU at 60% — ASG adds/removes instances automatically
Step Scaling	Add X instances when metric crosses threshold	CPU > 70% → add 2, CPU > 90% → add 4
Scheduled	Scale at specific times	Every Monday 9 AM → set to 10 instances, Friday 6 PM → set to 2
Predictive	ML-based prediction	AWS learns your traffic patterns and scales BEFORE the traffic comes

ASG + ALB Together

ARCHITECTUREASG Configuration: Min: 2 (always have at least 2 servers running) Desired: 4 (normally run 4 servers) Max: 10 (never exceed 10 servers) Launch Template: AMI: ami-0c55b (your custom app image) Instance Type: t3.medium Security Group: sg-web User Data: #!/bin/bash systemctl start nginx Scaling Policy: Target: Average CPU Utilization = 60% Cooldown: 300 seconds (wait 5 min between scaling actions) Load Balancer: arn:aws:elasticloadbalancing:.../my-alb Health Check: HTTP /health on port 8080 Result: Normal day: 4 instances running Sale event: ASG scales to 8 instances automatically Night time: ASG scales down to 2 instances Server crash: ASG replaces unhealthy instance in minutes

💡 Cooldown Period

After scaling up, ASG waits (default 300 seconds) before scaling again. This prevents flip-flopping — adding servers, immediately removing them, adding again. Give your new servers time to warm up.

06⚖️

Load Balancers

Split Traffic & Stay Available

A Load Balancer distributes incoming traffic across multiple EC2 instances. If one server dies, the load balancer stops sending traffic to it. Users never notice. AWS has 3 types of load balancers — each for different use cases.

3 Types of Load Balancers

Type	Layer	Best For	Key Feature
ALB (Application)	Layer 7 (HTTP)	Web apps, APIs, microservices	Path-based routing: /api → backend, /images → CDN
NLB (Network)	Layer 4 (TCP/UDP)	Gaming, IoT, extreme performance	Millions of requests/sec, static IP, ultra-low latency
CLB (Classic)	Layer 4+7	Legacy apps only	DEPRECATED — don't use for new projects

ALB — Application Load Balancer (Most Common)

🎯

Target Groups

A group of EC2 instances, IPs, or Lambda functions. ALB sends traffic to targets in the group. You can have multiple target groups.

👂

Listeners

Rules on which port to listen. Listener on port 443 (HTTPS) → forward to target group. Listener on port 80 → redirect to 443.

📋

Rules

Conditions that decide where traffic goes. IF path = /api/* THEN forward to api-target-group. IF host = admin.site.com THEN forward to admin-target-group.

🏃

Actions

What to do with matched traffic: forward to target group, redirect to another URL, return fixed response, or authenticate with Cognito.

ALB Routing Examples

ALB RULESListener: Port 443 (HTTPS) Rule 1: IF path = /api/* THEN forward to → api-target-group (port 8080) Rule 2: IF path = /admin/* THEN forward to → admin-target-group (port 3000) Rule 3: IF host = images.mysite.com THEN forward to → cdn-target-group Rule 4: IF path = /old-page THEN redirect to → https://mysite.com/new-page (301) Rule 5: IF path = /health THEN return fixed response → 200 OK "healthy" Default: Forward to → web-target-group (port 80)

NLB — Network Load Balancer

NLB works at the TCP level (Layer 4) — it doesn't look at HTTP headers or URLs. It just forwards raw TCP packets. This makes it incredibly fast (millions of requests per second) with ultra-low latency. Use NLB for gaming servers, real-time streaming, and when you need a static IP.

Health Checks

HEALTH CHECKHealth Check Configuration: Protocol: HTTP Path: /health Port: 8080 Healthy threshold: 3 (pass 3 checks = healthy) Unhealthy threshold: 2 (fail 2 checks = unhealthy) Interval: 30 seconds (check every 30s) Timeout: 5 seconds (wait max 5s for response) What happens when a server fails health check: 1. ALB marks it unhealthy 2. ALB stops sending new traffic to it 3. Existing connections drain gracefully 4. ASG detects unhealthy instance 5. ASG terminates it and launches a replacement 6. New instance registers with ALB 7. Health checks pass → traffic resumes All automatic. Zero human intervention.

07🌐

VPC — Virtual Private Cloud

Your Private Network in AWS

A VPC is your own private, isolated network inside AWS. Think of it like your own office building — you control who enters, which rooms connect to which, and who can access the internet. Every resource (EC2, RDS, Lambda) runs inside a VPC.

VPC Components — The Building Analogy

🏢

VPC

The entire building. You define the total address space: 10.0.0.0/16 = 65,536 IP addresses. One VPC per environment (dev VPC, prod VPC).

🚪

Subnets

Rooms inside the building. Public subnet = room with window to the street (internet access). Private subnet = internal room (no direct internet). You create subnets in different AZs for high availability.

🌍

Internet Gateway (IGW)

The main entrance door connecting your building to the internet. Attach to VPC → public subnets can reach the internet.

📡

NAT Gateway

A one-way mirror. Private subnet instances can ACCESS the internet (download updates) but the internet CANNOT reach them. Like making phone calls but having an unlisted number.

🗺️

Route Tables

Direction signs inside the building. They tell traffic where to go. Public route table: 0.0.0.0/0 → Internet Gateway. Private route table: 0.0.0.0/0 → NAT Gateway.

Typical Production VPC Architecture

ARCHITECTUREVPC: 10.0.0.0/16 (65,536 IPs) │ ├── Public Subnet A: 10.0.1.0/24 (AZ: ap-south-1a) │ ├── ALB (Application Load Balancer) │ ├── NAT Gateway │ └── Bastion Host (jump server for SSH) │ ├── Public Subnet B: 10.0.2.0/24 (AZ: ap-south-1b) │ └── ALB (second AZ for HA) │ ├── Private Subnet A: 10.0.10.0/24 (AZ: ap-south-1a) │ └── EC2 App Servers (order-service, user-service) │ ├── Private Subnet B: 10.0.20.0/24 (AZ: ap-south-1b) │ └── EC2 App Servers (replicas) │ ├── DB Subnet A: 10.0.100.0/24 (AZ: ap-south-1a) │ └── RDS Primary │ └── DB Subnet B: 10.0.200.0/24 (AZ: ap-south-1b) └── RDS Standby (Multi-AZ) Traffic Flow: User → ALB (public) → App Server (private) → RDS (db subnet) App Server → NAT Gateway → Internet (for updates) Internet ✗→ App Server (blocked — private subnet)

08🛡️

Security Groups & NACLs

Two Layers of Firewall

AWS gives you TWO levels of firewall: Security Groups (instance-level) and NACLs (subnet-level). Think of Security Groups as locks on each room door, and NACLs as security guards at each floor entrance. You need both for proper security.

Security Groups — Room Door Locks

✅

Stateful

If you allow inbound traffic on port 80, the RESPONSE is automatically allowed out. You don't need a separate outbound rule. Smart enough to track connections.

🟢

Allow Only

Security Groups can only ALLOW traffic. There is no "deny" rule. Everything not explicitly allowed is automatically denied.

🔗

Instance Level

Attached directly to EC2 instances, RDS, ALB. Multiple instances can share the same SG.

🔄

Reference Other SGs

Rule: "Allow traffic FROM sg-web-servers" — instead of hardcoding IPs. When new servers join sg-web-servers, they automatically get access.

Security Group Examples

SECURITY GROUPS# sg-web — for ALB (public-facing) Inbound: Port 80 (HTTP) → 0.0.0.0/0 (anyone on internet) Port 443 (HTTPS) → 0.0.0.0/0 (anyone on internet) Outbound: All traffic → 0.0.0.0/0 (allow all outbound) # sg-app — for application servers (private) Inbound: Port 8080 → sg-web (only from ALB security group) Port 22 → sg-bastion (only from bastion/jump server) Outbound: All traffic → 0.0.0.0/0 # sg-db — for RDS database (most restricted) Inbound: Port 3306 → sg-app (only from app servers) Outbound: None needed (stateful — responses auto-allowed)

NACLs — Floor Security Guards

Feature	Security Group	NACL
Level	Instance (EC2, RDS)	Subnet (entire floor)
Stateful/Stateless	Stateful (auto-allows response)	Stateless (must allow inbound AND outbound separately)
Rules	Allow only	Allow AND Deny
Rule Order	All rules evaluated	Rules evaluated in NUMBER order (100, 200, 300...)
Default	Deny all inbound, allow all outbound	Allow ALL (default NACL)
Use Case	Primary firewall for every resource	Extra layer — block specific IPs, compliance requirements

NACL Example

NACL# NACL for Public Subnet Inbound Rules (evaluated in order): Rule 100: Allow TCP 443 from 0.0.0.0/0 (HTTPS in) Rule 200: Allow TCP 80 from 0.0.0.0/0 (HTTP in) Rule 300: Allow TCP 22 from 10.0.0.0/8 (SSH from VPC only) Rule 900: DENY ALL from 203.0.113.50 (block specific attacker IP) Rule * : DENY ALL from 0.0.0.0/0 (default deny everything else) Outbound Rules: Rule 100: Allow TCP 1024-65535 to 0.0.0.0/0 (ephemeral ports for responses) Rule 200: Allow TCP 443 to 0.0.0.0/0 (outbound HTTPS) Rule * : DENY ALL to 0.0.0.0/0 (default deny)

💡 Interview Answer

\"Security Groups are like room door locks — stateful, allow-only, attached to individual resources. NACLs are like floor security guards — stateless, allow+deny, applied to entire subnets. Use Security Groups as primary firewall, NACLs as an additional security layer.\"

09🔗

VPN & VPC Peering

Connect Networks Securely

VPN creates an encrypted tunnel between your office/data center and AWS. VPC Peering connects two VPCs directly. These are how enterprises connect their on-premises networks to the cloud.

VPN Types

Type	What It Connects	Use Case	How It Works
Site-to-Site VPN	Your office network → AWS VPC	Connect entire office to AWS. All office users access AWS resources.	Encrypted IPSec tunnel between your router and AWS Virtual Private Gateway.
Client VPN	Individual laptop → AWS VPC	Remote employee needs to access private AWS resources from home.	OpenVPN-based. User installs VPN client, authenticates, gets access to VPC.
AWS Direct Connect	Your data center → AWS (dedicated)	High-bandwidth, low-latency connection. Not internet-based.	Physical fiber cable from your DC to AWS. 1 Gbps or 10 Gbps. Most expensive, most reliable.

Site-to-Site VPN — Most Common

ARCHITECTUREYour Office Router AWS ┌──────────┐ Encrypted Tunnel ┌──────────────┐ │ Customer │ ═══════════════════════ │ Virtual │ │ Gateway │ (IPSec VPN) │ Private │ │ Device │ │ Gateway │ └──────────┘ └──────┬───────┘ 192.168.0.0/16 │ (your office network) VPC: 10.0.0.0/16 (your AWS network) Result: Office computers (192.168.x.x) can reach AWS servers (10.0.x.x) through encrypted tunnel.

VPC Peering — Connect Two VPCs

VPC Peering creates a direct network connection between two VPCs. Traffic flows directly through AWS backbone — no internet, no VPN, no encryption overhead. Both VPCs can talk to each other as if they're on the same network.

VPC PEERINGVPC-A (Production): 10.0.0.0/16 ↕ (VPC Peering Connection) VPC-B (Staging): 172.16.0.0/16 Rules: ✅ VPC-A can talk to VPC-B and vice versa ❌ VPC Peering is NOT transitive: If A↔B and B↔C, that does NOT mean A↔C You need a separate peering for A↔C ❌ CIDR ranges must NOT overlap (both can't use 10.0.0.0/16) For connecting 10+ VPCs: Use Transit Gateway instead Transit Gateway = central hub, all VPCs connect to it Much simpler than managing 45 peering connections

VPN Endpoint Types

Endpoint Type	What It Is	Use Case
Gateway Endpoint	Free, for S3 and DynamoDB only	EC2 in private subnet accessing S3 without going through NAT (saves NAT costs)
Interface Endpoint	Creates ENI in your subnet	Access 80+ AWS services privately (CloudWatch, SQS, ECR, Secrets Manager) without internet
Gateway Load Balancer Endpoint	For third-party appliances	Route traffic through firewall appliances (Palo Alto, Fortinet) before reaching your app

💡 Save Money

Gateway Endpoints for S3 are FREE. If your app servers in private subnets access S3 heavily, use a Gateway Endpoint instead of routing through NAT Gateway (which charges per GB).

10🔥

WAF — Web Application Firewall

Protect Your Apps from Attacks

WAF sits in front of your ALB or CloudFront and inspects every HTTP request before it reaches your application. It blocks SQL injection, cross-site scripting (XSS), bot traffic, and other common web attacks. Think of it as a smart security guard who reads every letter before delivering it.

WAF Components

📋

Web ACL

A collection of rules. Attached to ALB or CloudFront. Each request is checked against all rules in order.

📜

Rules

Individual checks: "block if request contains SQL injection", "allow if IP is in whitelist", "rate limit to 1000 requests per 5 minutes per IP".

📦

Rule Groups

Pre-packaged sets of rules. AWS Managed Rule Groups cover OWASP Top 10, known bad IPs, bots, and more. Enable with one click.

WAF Rule Types

Rule Type	What It Does	Example
IP Set	Allow/block specific IPs	Block country-level IPs or allow only office IPs
Rate-based	Block IPs sending too many requests	Block if > 2000 requests in 5 minutes (DDoS protection)
SQL Injection	Detect SQL injection in request body/URL	Block: /api?id=1; DROP TABLE users
XSS	Detect cross-site scripting attempts	Block: <script>alert("hack")</script>
Geo Match	Allow/block by country	Block traffic from countries you don't serve
AWS Managed	Pre-built rule groups by AWS	AWSManagedRulesCommonRuleSet covers OWASP Top 10

✓Attach WAF to ALB or CloudFront (not directly to EC2)

✓Enable AWS Managed Rules: Common Rule Set + Known Bad Inputs + IP Reputation

✓Add rate limiting: 2000 requests/5 min per IP for APIs

✓Log all blocked requests to S3 for security analysis

✓Use WAF + Shield Standard (free DDoS protection) together

11🌍

Route 53 — DNS

Domain Names & Traffic Routing

Route 53 is AWS's DNS service. It translates human-readable domain names (myapp.com) into IP addresses (52.66.123.45) that computers understand. It also does health checks and intelligent traffic routing.

Routing Policies

Policy	How It Works	Use Case
Simple	One domain → one IP/resource	Basic website, single server
Weighted	Split traffic by percentage	80% to v2, 20% to v3 (canary deployment)
Latency	Route to nearest region	Global app: users in India → Mumbai, users in US → Virginia
Failover	Primary → Backup if primary is down	Primary in Mumbai, backup in Singapore. Auto-failover on health check failure.
Geolocation	Route by user's country	Indian users → Indian servers, US users → US servers
Multi-value	Return multiple healthy IPs	Simple load balancing at DNS level

Common DNS Record Types

Record	Points To	Example
A	IPv4 address	myapp.com → 52.66.123.45
AAAA	IPv6 address	myapp.com → 2001:db8::1
CNAME	Another domain name	www.myapp.com → myapp.com
Alias	AWS resource (free, faster)	myapp.com → d111.cloudfront.net (ALB, CloudFront, S3)
MX	Mail server	myapp.com → mail.google.com (for email)
TXT	Text verification	Used for SSL verification, SPF, DKIM

💡 Alias vs CNAME

Use Alias (not CNAME) for AWS resources like ALB, CloudFront, S3. Alias is free (no query charges), works at zone apex (myapp.com without www), and resolves faster. CNAME doesn't work at zone apex.

12📦

S3 — Object Storage

Store Anything, Scale Infinitely

S3 stores unlimited files (called objects) with 99.999999999% durability (11 nines — that means if you store 10 million files, you'd lose 1 file every 10,000 years). Used for backups, static websites, Docker images, Terraform state, logs, and data lakes.

Storage Classes — Pay Only for What You Need

Class	Access Pattern	Cost	Example
Standard	Frequently accessed	Highest	App assets, active logs, Docker layers
Standard-IA	Infrequent (1-2 times/month)	~40% less	Backups, disaster recovery files
Glacier Instant	Archive, millisecond access	~68% less	Compliance docs you rarely access
Glacier Deep Archive	Archive, 12-hour retrieval	Cheapest	7-year audit logs, old backups

Lifecycle Rules — Automate Cost Savings

LIFECYCLE# Move objects automatically based on age: Day 0-30: S3 Standard (frequently accessed) Day 30-90: S3 Standard-IA (still needed but rarely) Day 90-365: S3 Glacier Instant (archive, occasional access) Day 365+: S3 Glacier Deep (long-term archive) Day 730: DELETE automatically (no longer needed) # This can save 60-80% on storage costs!

S3 CLI Commands

$ aws s3 mb s3://my-app-bucketCreate bucket

$ aws s3 cp app.jar s3://my-app-bucket/releases/Upload file

$ aws s3 sync ./build s3://frontend-bucket --deleteSync folder (deploy React app)

$ aws s3 ls s3://my-app-bucket/ --recursiveList all objects

$ aws s3 presign s3://bucket/file.pdf --expires-in 3600Generate temp download URL (1 hour)

13🗄️

RDS — Managed Databases

MySQL, PostgreSQL, Aurora

RDS manages your database infrastructure — AWS handles patches, backups, replication, and failover. You focus on your data and queries. It's like having a DBA team included for free.

Supported Engines

Engine	Best For	Special Feature
MySQL	General purpose, most popular	Compatible with existing MySQL apps
PostgreSQL	Complex queries, GIS data	Advanced data types, extensions
Aurora	High performance, auto-scaling	5x faster than MySQL, auto-grows storage
SQL Server	Microsoft/.NET shops	Windows auth, SSIS support
MariaDB	MySQL alternative	Open source, community-driven

Multi-AZ — High Availability

Multi-AZ creates a standby replica in a different Availability Zone. If the primary database crashes, AWS automatically switches to the standby in under 60 seconds. Your application doesn't even need to change connection strings.

Read Replicas — Handle More Traffic

Read Replicas are copies of your database that handle read queries. Your main database handles writes, replicas handle reads. If your app does 80% reads and 20% writes, read replicas can handle 4x more traffic.

✓Enable Multi-AZ for production (automatic failover in <60 seconds)

✓Create Read Replicas for read-heavy applications

✓Enable automated backups with 7+ day retention

✓Place RDS in PRIVATE subnets (never public!)

✓Use IAM authentication instead of password-only

✓Set up CloudWatch alarms for CPU, connections, and free storage

14📊

CloudWatch — Monitoring

The Eyes of Your AWS Infrastructure

CloudWatch collects metrics (numbers), logs (text), and alarms (alerts) from every AWS service. It's like having security cameras and sensors throughout your building. Without CloudWatch, you're flying blind.

Three Pillars of CloudWatch

📈

Metrics

Numbers over time: CPU 73%, Memory 4.2 GB, Request count 1500/sec. Every AWS service automatically sends metrics. EC2 sends CPU, disk, network. ALB sends request count, latency, error rate.

📝

Logs

Text output from your applications. Store, search, and analyze logs centrally. CloudWatch Logs Agent on EC2 sends /var/log/syslog and app logs to CloudWatch. Replaces the need to SSH and grep.

🔔

Alarms

Automatic alerts when metrics cross thresholds. CPU > 85% for 5 minutes → send SNS notification to Slack/email/PagerDuty. Can also trigger Auto Scaling actions.

Important Default EC2 Metrics

Metric	What It Measures	Alert Threshold
CPUUtilization	Percentage of CPU used	> 85% for 5 min
NetworkIn/Out	Bytes transferred	Unusual spike (DDoS indicator)
StatusCheckFailed	Hardware/software health	> 0 (something is wrong)
DiskReadOps/WriteOps	Disk I/O operations	High IOPS = disk bottleneck

CloudWatch Logs — Search All Logs in One Place

JSON# Install CloudWatch Agent on EC2 sudo yum install amazon-cloudwatch-agent # Configure which log files to send # /opt/aws/amazon-cloudwatch-agent/etc/config.json { "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/syslog", "log_group_name": "production/syslog" }, { "file_path": "/var/log/myapp/*.log", "log_group_name": "production/myapp" } ] } } } }

💡 CloudWatch vs Prometheus/Grafana

CloudWatch is AWS-native — zero setup, works immediately. Prometheus/Grafana is open-source — more powerful queries, better dashboards, works across clouds. Many companies use BOTH: CloudWatch for AWS infrastructure, Prometheus/Grafana for application metrics.

15🐳

ECS & EKS — Containers on AWS

Run Docker in Production

ECS (Elastic Container Service) and EKS (Elastic Kubernetes Service) run your Docker containers on AWS. ECS is simpler and AWS-native. EKS gives you full Kubernetes.

Feature	ECS	EKS
Complexity	Simple — AWS proprietary	Complex — full Kubernetes
Learning Curve	Low (if you know Docker)	High (need K8s knowledge)
Portability	AWS only	Multi-cloud, on-prem
Control Plane Cost	Free	$0.10/hour (~$73/month)
Best For	AWS-only teams, simpler apps	K8s teams, multi-cloud, complex apps

ECS with Fargate (Serverless Containers)

Fargate runs your containers WITHOUT managing servers. You define CPU and memory, upload your Docker image, and Fargate handles the rest. No EC2 instances to patch, no clusters to manage. Just containers.

ECR — Elastic Container Registry

$ aws ecr get-login-password | docker login --username AWS --password-stdin 123456.dkr.ecr.ap-south-1.amazonaws.comLogin to ECR

$ docker tag myapp:latest 123456.dkr.ecr.ap-south-1.amazonaws.com/myapp:latestTag image for ECR

$ docker push 123456.dkr.ecr.ap-south-1.amazonaws.com/myapp:latestPush to ECR

16⚡

Lambda — Serverless

Run Code Without Servers

Lambda runs your code in response to events. Someone uploads a file to S3? Lambda processes it. API request comes in? Lambda handles it. You pay only for the milliseconds your code runs. Zero servers to manage.

Lambda Limits

Resource	Limit
Timeout	15 minutes max
Memory	128 MB to 10 GB
Package Size	50 MB zipped, 250 MB unzipped
Concurrent Executions	1000 (default, can increase)
/tmp Storage	512 MB to 10 GB

Common Lambda Triggers for DevOps

📦

S3 Event

File uploaded → Lambda processes it (resize image, scan for malware, parse CSV)

🌐

API Gateway

HTTP request → Lambda handles it (serverless API, webhook handler)

⏰

CloudWatch Event/Cron

Every 5 minutes → Lambda runs (cleanup old snapshots, check SSL expiry)

📨

SQS Queue

Message arrives → Lambda processes it (order processing, email sending)

💡 Use Lambda For DevOps

Auto-cleanup old AMIs, rotate secrets, notify Slack on CloudWatch alarms, process CloudTrail logs, auto-tag untagged resources, backup DynamoDB tables. These small automation tasks are perfect for Lambda.

17🏗️

CloudFormation — IaC

Define Infrastructure in Code

CloudFormation lets you write your entire AWS infrastructure in YAML/JSON files. Instead of clicking 50 buttons in the AWS Console to create a VPC, subnets, EC2, ALB, and RDS — you write one YAML file and CloudFormation creates everything automatically. Delete the stack → everything is cleaned up.

CloudFormation Template

CLOUDFORMATIONAWSTemplateFormatVersion: '2010-09-09' Description: Web app with ALB and Auto Scaling Parameters: InstanceType: Type: String Default: t3.micro AllowedValues: [t3.micro, t3.small, t3.medium] Environment: Type: String Default: staging Resources: WebSecurityGroup: Type: AWS::EC2::SecurityGroup Properties: GroupDescription: Allow HTTP SecurityGroupIngress: - IpProtocol: tcp FromPort: 80 ToPort: 80 CidrIp: 0.0.0.0/0 WebServer: Type: AWS::EC2::Instance Properties: InstanceType: !Ref InstanceType ImageId: ami-0c55b159cbfafe1f0 SecurityGroupIds: - !Ref WebSecurityGroup Tags: - Key: Name Value: !Sub '${Environment}-web-server' - Key: Environment Value: !Ref Environment Outputs: ServerIP: Value: !GetAtt WebServer.PublicIp Description: Public IP of web server

CloudFormation CLI

$ aws cloudformation create-stack --stack-name my-stack --template-body file://template.ymlCreate stack

$ aws cloudformation update-stack --stack-name my-stack --template-body file://template.ymlUpdate stack

$ aws cloudformation delete-stack --stack-name my-stackDelete stack (removes ALL resources)

$ aws cloudformation describe-stacks --stack-name my-stackCheck stack status

💡 CloudFormation vs Terraform

CloudFormation: AWS-only, deeply integrated, free, no state file to manage. Terraform: multi-cloud (AWS + Azure + GCP), needs state management, more flexible. For AWS-only teams, CloudFormation is simpler. For multi-cloud, use Terraform.

18💼

Interview Questions

40+ AWS DevOps Q&A

The most asked AWS questions in DevOps interviews — from freshers to experienced.

Networking & Security

Security Group vs NACL?

SG: instance-level, stateful, allow-only. NACL: subnet-level, stateless, allow+deny. SG = room door lock, NACL = floor security guard. Use SGs as primary firewall.

Public vs Private Subnet?

Public: has route to Internet Gateway (resources get public IPs). Private: no direct internet route (use NAT Gateway for outbound). Put app servers in private, ALB in public.

NAT Gateway vs Internet Gateway?

IGW: two-way door (internet can reach you). NAT: one-way mirror (you can reach internet, but internet cannot reach you). Private subnets use NAT for updates.

Site-to-Site VPN vs Direct Connect?

VPN: encrypted tunnel over internet, cheap, quick to set up, variable latency. Direct Connect: dedicated fiber cable, expensive, 1-10 Gbps, consistent low latency.

Compute & Scaling

What is Auto Scaling?

Automatically adds EC2 instances when traffic increases, removes when it drops. Min/desired/max capacity. Uses Launch Templates to create identical instances.

ALB vs NLB?

ALB: Layer 7, HTTP/HTTPS, path-based routing, supports WebSocket. NLB: Layer 4, TCP/UDP, static IP, millions of requests/sec. ALB for web apps, NLB for gaming/TCP.

What are ALB Listeners and Rules?

Listener: port+protocol ALB listens on (443 HTTPS). Rules: conditions that route traffic (IF path=/api THEN forward to api-target-group).

Spot vs Reserved vs On-Demand?

On-Demand: pay per hour, no commitment. Reserved: 1-3 year commitment, 72% savings. Spot: unused capacity, 90% savings, can be terminated. Use Reserved for production, Spot for CI/CD.

Storage & Database

EBS vs S3?

EBS: block storage, attached to EC2, like a hard drive. S3: object storage, accessed via HTTP, unlimited size. EBS for OS/databases, S3 for files/backups/static content.

S3 Storage Classes?

Standard (frequent), Standard-IA (infrequent), Glacier Instant (archive), Glacier Deep (long-term). Use Lifecycle Rules to auto-move objects between classes.

RDS Multi-AZ vs Read Replica?

Multi-AZ: standby in another AZ for failover (high availability). Read Replica: copy for read queries (performance). Multi-AZ for HA, Read Replicas for scale.

What is Route 53?

AWS DNS service. Translates domain names to IPs. Supports routing policies: simple, weighted (canary), latency (nearest region), failover (disaster recovery).

Monitoring & IaC

What is CloudWatch?

AWS monitoring: Metrics (CPU, memory), Logs (centralized log search), Alarms (alert when thresholds crossed). Every AWS service sends data to CloudWatch automatically.

CloudFormation vs Terraform?

CloudFormation: AWS-only, free, no state file. Terraform: multi-cloud, needs state management, more flexible. CloudFormation for AWS-only, Terraform for multi-cloud.

What is WAF?

Web Application Firewall. Sits in front of ALB/CloudFront. Blocks SQL injection, XSS, bot traffic, rate-limits IPs. Use AWS Managed Rules for OWASP Top 10 protection.

VPC Peering vs Transit Gateway?

Peering: direct 1-to-1 VPC connection, not transitive. Transit Gateway: central hub connecting many VPCs (star topology). Use TGW for 5+ VPCs.

Gateway vs Interface Endpoint?

Gateway: free, S3 and DynamoDB only. Interface: creates ENI in subnet, 80+ AWS services, costs per hour+GB. Use Gateway for S3 (saves NAT costs).

ECS vs EKS?

ECS: simple, AWS-native, free control plane. EKS: full Kubernetes, portable, $73/month control plane. ECS for simplicity, EKS for K8s teams.