📊 Observability Stack

Monitoring Complete Guide

Prometheus, Grafana, Loki, Promtail — from installation to production dashboards with real PromQL queries for CPU, RAM, and Kubernetes monitoring.

Chapters

30+

PromQL

100%

Free

01📊

Introduction to Monitoring

Why You Need Observability

Monitoring answers one question: "Is my system healthy RIGHT NOW?" Without monitoring, you only find out something is broken when users complain. With monitoring, you detect and fix issues BEFORE users notice. The modern observability stack: Prometheus (metrics), Grafana (dashboards), Loki (logs), Promtail (log collection), Alertmanager (alerts).

📈

Metrics

Numbers over time — CPU 73%, memory 4.2 GB, requests 1500/sec. Prometheus collects these.

📝

Logs

Text output from apps — error messages, request logs. Loki + Promtail collect these.

📊

Dashboards

Visual graphs and charts. Grafana turns raw metrics into beautiful, actionable dashboards.

🔔

Alerts

Automatic notifications when something goes wrong. Alertmanager sends to Slack, PagerDuty, email.

Tool	What It Does	Think of It As
Prometheus	Collects and stores metrics	The data collector — scrapes numbers from your servers every 15 seconds
Grafana	Visualizes metrics	The TV screen — shows beautiful graphs and dashboards
Alertmanager	Sends alerts	The alarm system — pages you when CPU > 90%
Loki	Stores logs	The filing cabinet — stores all your application logs
Promtail	Collects logs	The mail carrier — sends logs from servers to Loki

02🔥

Prometheus

The Metrics Engine

Prometheus PULLS metrics from your services every 15 seconds. Your service exposes a /metrics endpoint with numbers, Prometheus scrapes it, stores it in its time-series database, and you query it with PromQL. Think of it as a health inspector who visits every restaurant every 15 seconds and writes down temperature, hygiene score, and customer count.

Installation

TERMINAL# Docker (quickest) docker run -d --name prometheus -p 9090:9090 \ -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus # Access: http://your-server:9090 # Key exporters you need: # Node Exporter → CPU, RAM, disk metrics (install on every server) # kube-state-metrics → Kubernetes pod/node/deploy metrics # cAdvisor → Container-level metrics

Key Exporters

Exporter	Metrics It Provides	Port
Node Exporter	CPU, RAM, disk, network for Linux servers	9100
kube-state-metrics	Pod status, deployments, node conditions	8080
cAdvisor	Container CPU, memory, network	8080
MySQL Exporter	Queries, connections, replication	9104
Nginx Exporter	Requests, connections, response codes	9113

03⚙️

Prometheus Configuration

prometheus.yml Explained

YAML# /etc/prometheus/prometheus.yml global: scrape_interval: 15s # How often to collect metrics evaluation_interval: 15s # How often to evaluate alert rules alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"] rule_files: - "alert_rules.yml" # Alert rules file scrape_configs: # Prometheus monitors itself - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # Monitor Linux servers via Node Exporter - job_name: "node-exporter" static_configs: - targets: - "web1:9100" - "web2:9100" - "db1:9100" # Prometheus visits each target every 15 seconds # and collects CPU, RAM, disk, network metrics # Monitor Kubernetes - job_name: "kube-state-metrics" static_configs: - targets: ["kube-state-metrics:8080"]

04🔍

PromQL Basics

Query Language for Metrics

PromQL is how you ASK Prometheus questions about your metrics. Think of it as SQL for time-series data. "What is the current CPU usage?" "How many requests per second?" "Show me memory trend over 24 hours."

PromQL Concept	What It Does	Example
Instant Vector	Current value of a metric	node_cpu_seconds_total
Range Vector	Values over a time window	node_cpu_seconds_total[5m]
rate()	Per-second rate of change	rate(http_requests_total[5m])
sum()	Add values together	sum(rate(http_requests_total[5m]))
by (label)	Group results by label	sum(rate(...)) by (instance)
count()	Count number of series	count(node_cpu_seconds_total)
avg()	Average of values	avg(rate(node_cpu_seconds_total[5m]))

Common PromQL Patterns

PROMQL# Total HTTP requests per second rate(http_requests_total[5m]) # HTTP error rate (5xx errors) sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # Average response latency (95th percentile) histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Number of running pods count(kube_pod_status_phase{phase="Running"})

rate() vs irate()

rate() calculates the average per-second rate over the entire range (smooth, good for alerts). irate() uses only the last two data points (spiky, good for graphs). Use rate() for alerts, irate() for dashboards.

05🖥️

CPU Monitoring Queries

Complete CPU PromQL with Explanations

These are the exact PromQL queries used in production to monitor CPU usage across all nodes. Each query is explained line by line so you understand what every part does.

1. Total CPU Cores Per Node

PROMQLcount(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}) by (instance) # What this does: # node_cpu_seconds_total → metric from Node Exporter # {mode="idle"} → filter: only idle CPU (one entry per core when idle) # {instance=~".*:9100"} → regex match: all instances on port 9100 # count(...) → count the number of matching series = number of cores # by (instance) → group by server # # Example output: # {instance="web1:9100"} → 4 (web1 has 4 CPU cores) # {instance="web2:9100"} → 8 (web2 has 8 CPU cores)

1a. Used CPU in Cores Per Node

PROMQLsum(rate(node_cpu_seconds_total{mode!="idle", instance=~".*:9100"}[5m])) by (instance) # What this does: # mode!="idle" → all CPU modes EXCEPT idle (user, system, iowait, etc.) # rate(...[5m]) → per-second rate over last 5 minutes # sum(...) → add all non-idle CPU modes together # by (instance) → per server # # Example: {instance="web1:9100"} → 2.3 (using 2.3 out of 4 cores)

1b. Available CPU in Cores Per Node

PROMQLsum(rate(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}[5m])) by (instance) # Same as above but mode="idle" → shows how much CPU is FREE # Example: {instance="web1:9100"} → 1.7 (1.7 cores available)

1c. CPU Usage in Percentage Per Node

PROMQL100 - ( sum(rate(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}[5m])) by (instance) / count(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}) by (instance) * 100 ) # How it works step by step: # 1. sum(rate(idle[5m])) → idle cores (e.g., 1.7) # 2. count(idle) → total cores (e.g., 4) # 3. idle / total * 100 → idle percentage (42.5%) # 4. 100 - idle% → USED percentage (57.5%) # # Example: {instance="web1:9100"} → 57.5 (57.5% CPU used)

1d. Available CPU in Percentage Per Node

PROMQL( sum(rate(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}[5m])) by (instance) / count(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}) by (instance) * 100 ) # Same calculation without the "100 -" → gives available percentage # Example: {instance="web1:9100"} → 42.5 (42.5% CPU available)

06💾

RAM Monitoring Queries

Complete Memory PromQL with Explanations

2. Total RAM in GB Per Node

PROMQLsum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance) / 1024 / 1024 / 1024 # node_memory_MemTotal_bytes → total RAM in bytes # Divide by 1024 three times → bytes → KB → MB → GB # Example: {instance="web1:9100"} → 15.6 (15.6 GB total RAM)

2a. Available RAM in GB Per Node

PROMQLsum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance) / 1024 / 1024 / 1024 # MemAvailable = memory that can be used without swapping # Example: {instance="web1:9100"} → 8.2 (8.2 GB available)

2b. Used RAM in GB Per Node

PROMQL( sum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance) - sum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance) ) / 1024 / 1024 / 1024 # Total minus Available = Used # Example: 15.6 - 8.2 = 7.4 GB used

2c. Available RAM Percentage Per Node

PROMQL( sum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance) / sum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance) ) * 100 # Available / Total * 100 = Available % # Example: 8.2 / 15.6 * 100 = 52.6% available

2d. Used RAM Percentage Per Node

PROMQL( 1 - ( sum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance) / sum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance) ) ) * 100 # 1 minus (Available/Total) = Used fraction → times 100 = Used % # Example: (1 - 0.526) * 100 = 47.4% RAM used

07☸️

Kubernetes Monitoring Queries

Pod, Node & Cluster Health

These queries monitor your Kubernetes cluster — pod status, crashes, scheduling issues, and node health. Essential for any production K8s environment.

Total Running Pods by Phase

PROMQLsum by (phase) (kube_pod_status_phase) # Shows count of pods in each phase: # Running: 45 # Succeeded: 12 # Pending: 2 # Failed: 1 # Unknown: 0

CrashLoopBackOff Pods — CRITICAL ALERT

PROMQLkube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0 # Shows pods stuck in CrashLoopBackOff # This means the container keeps crashing and K8s keeps restarting it # ALWAYS set an alert on this — it means something is seriously wrong # Common causes: wrong config, missing secrets, OOM killed, bad image

Pending Pods — Scheduling Issues

PROMQLkube_pod_status_phase{phase="Pending"} == 1 # Pods stuck in Pending = K8s cannot schedule them # Common causes: # - Not enough CPU/RAM on any node # - Node selector/affinity mismatch # - PVC not bound (storage not available) # - Taints preventing scheduling

Node Health Status

PROMQLkube_node_status_condition{condition="Ready", status="true"} # Shows which nodes are Ready (healthy) # Value 1 = node is Ready # Value 0 = node is NOT Ready (serious problem!) # Alert if any node shows 0 — means workloads may be evicted

More Useful K8s Queries

PROMQL# Pods NOT running (failed, pending, unknown) kube_pod_status_phase{phase!="Running", phase!="Succeeded"} == 1 # Container restart count (high restarts = unstable app) sum(kube_pod_container_status_restarts_total) by (pod, namespace) > 5 # Deployment replicas not matching desired kube_deployment_status_replicas_available != kube_deployment_spec_replicas # OOMKilled containers (ran out of memory) kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0

08📊

Grafana Setup

Beautiful Dashboards

Grafana connects to Prometheus (and Loki) and turns raw metrics into visual dashboards. It does NOT collect data itself — it only DISPLAYS data from other sources.

TERMINAL# Install Grafana with Docker docker run -d --name grafana -p 3000:3000 grafana/grafana # Access: http://your-server:3000 # Default login: admin / admin (change immediately!) # Add Prometheus as Data Source: # 1. Settings → Data Sources → Add → Prometheus # 2. URL: http://prometheus:9090 # 3. Click Save & Test

Import Pre-built Dashboards

✓Go to Dashboards → Import → Enter Dashboard ID

✓Dashboard 1860 — Node Exporter Full (CPU, RAM, disk, network per server)

✓Dashboard 315 — Kubernetes Cluster Monitoring

✓Dashboard 13770 — Kubernetes Pod Monitoring

✓Dashboard 3662 — Prometheus 2.0 Overview

✓These are FREE community dashboards — production-ready in 30 seconds

09📈

Building Custom Dashboards

Create Your Own Panels

Pre-built dashboards are great, but real DevOps engineers build CUSTOM dashboards tailored to their applications and SLAs.

✓Create a new dashboard → Add Panel

✓Select data source: Prometheus

✓Enter PromQL query (from chapters 5-7)

✓Choose visualization: Time Series, Gauge, Stat, Bar, Table

✓Set thresholds: green < 60%, yellow < 80%, red >= 80%

✓Add variables for dynamic filtering (namespace, instance, pod)

✓Save and share with team

10🔔

Alertmanager

Get Notified Before Users Complain

Alertmanager receives alerts from Prometheus and routes them to Slack, PagerDuty, email, or Teams. You define alert RULES in Prometheus and ROUTING in Alertmanager.

Alert Rules (in Prometheus)

YAML# /etc/prometheus/alert_rules.yml groups: - name: critical_alerts rules: - alert: HighCPUUsage expr: 100 - (sum(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) / count(node_cpu_seconds_total{mode="idle"}) by (instance) * 100) > 85 for: 5m labels: severity: warning annotations: summary: "High CPU on {{ $labels.instance }}" description: "CPU usage is above 85% for 5 minutes" - alert: PodCrashLooping expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0 for: 2m labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} is CrashLoopBackOff"

Alertmanager Config (routing)

YAML# /etc/alertmanager/alertmanager.yml route: receiver: slack-critical routes: - match: severity: critical receiver: slack-critical - match: severity: warning receiver: slack-warnings receivers: - name: slack-critical slack_configs: - channel: "#alerts-critical" api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" title: "CRITICAL: {{ .GroupLabels.alertname }}" text: "{{ .CommonAnnotations.description }}"

11📝

Loki — Log Aggregation

Centralized Logging

Loki is like Prometheus but for LOGS. Instead of metrics (numbers), Loki stores log lines (text). You query logs in Grafana using LogQL — similar to PromQL. Think of it as the "grep for your entire infrastructure" — search all logs from all servers in one place.

TERMINAL# Install Loki with Docker docker run -d --name loki -p 3100:3100 grafana/loki # Add Loki as Grafana Data Source: # Settings → Data Sources → Add → Loki # URL: http://loki:3100

LogQL — Query Language for Logs

LOGQL# Show all logs from nginx {job="nginx"} # Filter logs containing "error" {job="nginx"} |= "error" # Filter logs NOT containing "health" {job="nginx"} != "health" # Regex filter {job="nginx"} |~ "status=(500|502|503)" # Count errors per minute count_over_time({job="nginx"} |= "error" [1m])

12📨

Promtail — Log Collection

Ship Logs to Loki

Promtail runs on every server, reads log files, and sends them to Loki. Like a postman who picks up letters (logs) from every house (server) and delivers them to the post office (Loki).

YAML# /etc/promtail/config.yml server: http_listen_port: 9080 positions: filename: /tmp/positions.yaml # Remember where we stopped reading clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: [localhost] labels: job: system __path__: /var/log/syslog - job_name: nginx static_configs: - targets: [localhost] labels: job: nginx __path__: /var/log/nginx/*.log - job_name: myapp static_configs: - targets: [localhost] labels: job: myapp __path__: /opt/myapp/logs/*.log

13🏗️

Full Observability Stack

Docker Compose Setup

Run the complete monitoring stack with one command. This docker-compose.yml gives you Prometheus + Grafana + Alertmanager + Loki + Promtail — the entire observability platform.

DOCKER-COMPOSE# docker-compose.yml — Full monitoring stack version: "3" services: prometheus: image: prom/prometheus ports: ["9090:9090"] volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alert_rules.yml:/etc/prometheus/alert_rules.yml grafana: image: grafana/grafana ports: ["3000:3000"] environment: GF_SECURITY_ADMIN_PASSWORD: admin123 alertmanager: image: prom/alertmanager ports: ["9093:9093"] volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml loki: image: grafana/loki ports: ["3100:3100"] promtail: image: grafana/promtail volumes: - /var/log:/var/log - ./promtail.yml:/etc/promtail/config.yml node-exporter: image: prom/node-exporter ports: ["9100:9100"] # Start everything: docker-compose up -d # Grafana: http://localhost:3000 (admin/admin123) # Prometheus: http://localhost:9090

14🏆

Best Practices

Production Monitoring Patterns

✓Monitor the 4 Golden Signals: latency, traffic, errors, saturation

✓Set alerts on symptoms (high error rate) not causes (high CPU)

✓Use rate() over 5m for alerts, 1m for dashboards

✓Dashboard per team: platform dashboard, application dashboard, business dashboard

✓Retention: 15 days local Prometheus, long-term in Thanos/Cortex

✓Label cardinality: avoid high-cardinality labels (user IDs, request IDs)

✓Alert fatigue: only alert on actionable issues. If you ignore an alert, delete it

✓Loki for logs, Prometheus for metrics — do NOT put logs in Prometheus

✓Always monitor your monitoring: if Prometheus is down, who alerts you?

15💼

Interview Questions

Monitoring & Observability Q&A

Prometheus vs CloudWatch?

Prometheus: open-source, pull-based, PromQL, self-hosted. CloudWatch: AWS-native, push-based, limited queries. Prometheus is more powerful; CloudWatch is zero-setup for AWS.

What is PromQL?

Query language for Prometheus metrics. Like SQL for time-series data. Key functions: rate(), sum(), count(), histogram_quantile(). Used in Grafana panels and alert rules.

Pull vs Push monitoring?

Pull (Prometheus): scrapes targets every 15s. Push (CloudWatch, Datadog): services send metrics to collector. Pull is better for service discovery; Push for short-lived jobs.

What are the 4 Golden Signals?

Latency (response time), Traffic (requests/sec), Errors (error rate), Saturation (resource usage). Monitor these 4 and you cover 90% of issues.

Grafana vs Kibana?

Grafana: multi-source (Prometheus, Loki, CloudWatch), best for metrics dashboards. Kibana: Elasticsearch only, best for log analysis. Most teams use Grafana for metrics + Loki for logs.

Loki vs ELK Stack?

Loki: lightweight, indexes only labels (not full text), cheap storage. ELK (Elasticsearch): indexes everything, powerful search, expensive. Loki is 10x cheaper for most use cases.

What is Alertmanager?

Receives alerts from Prometheus, deduplicates, groups, and routes to Slack/PagerDuty/email. Supports silencing, inhibition, and routing by severity.

How to monitor K8s?

kube-state-metrics for pod/deploy/node status. Node Exporter for server metrics. cAdvisor for container metrics. PromQL queries for CrashLoopBackOff, Pending pods, OOMKilled.