DevOpsClicks
← Home
πŸ“Š Complete Observability Stack

Monitoring Complete Guide

πŸ“Š Observability Stack

Monitoring Complete Guide

Prometheus, Grafana, Loki, Promtail β€” from installation to production dashboards with real PromQL queries for CPU, RAM, and Kubernetes monitoring.

15
Chapters
30+
PromQL
100%
Free
01πŸ“Š

Introduction to Monitoring

Why You Need Observability

Monitoring answers one question: "Is my system healthy RIGHT NOW?" Without monitoring, you only find out something is broken when users complain. With monitoring, you detect and fix issues BEFORE users notice. The modern observability stack: Prometheus (metrics), Grafana (dashboards), Loki (logs), Promtail (log collection), Alertmanager (alerts).
πŸ“ˆ
Metrics
Numbers over time β€” CPU 73%, memory 4.2 GB, requests 1500/sec. Prometheus collects these.
πŸ“
Logs
Text output from apps β€” error messages, request logs. Loki + Promtail collect these.
πŸ“Š
Dashboards
Visual graphs and charts. Grafana turns raw metrics into beautiful, actionable dashboards.
πŸ””
Alerts
Automatic notifications when something goes wrong. Alertmanager sends to Slack, PagerDuty, email.
ToolWhat It DoesThink of It As
PrometheusCollects and stores metricsThe data collector β€” scrapes numbers from your servers every 15 seconds
GrafanaVisualizes metricsThe TV screen β€” shows beautiful graphs and dashboards
AlertmanagerSends alertsThe alarm system β€” pages you when CPU > 90%
LokiStores logsThe filing cabinet β€” stores all your application logs
PromtailCollects logsThe mail carrier β€” sends logs from servers to Loki
02πŸ”₯

Prometheus

The Metrics Engine

Prometheus PULLS metrics from your services every 15 seconds. Your service exposes a /metrics endpoint with numbers, Prometheus scrapes it, stores it in its time-series database, and you query it with PromQL. Think of it as a health inspector who visits every restaurant every 15 seconds and writes down temperature, hygiene score, and customer count.
Installation
TERMINAL# Docker (quickest) docker run -d --name prometheus -p 9090:9090 \ -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus # Access: http://your-server:9090 # Key exporters you need: # Node Exporter β†’ CPU, RAM, disk metrics (install on every server) # kube-state-metrics β†’ Kubernetes pod/node/deploy metrics # cAdvisor β†’ Container-level metrics
Key Exporters
ExporterMetrics It ProvidesPort
Node ExporterCPU, RAM, disk, network for Linux servers9100
kube-state-metricsPod status, deployments, node conditions8080
cAdvisorContainer CPU, memory, network8080
MySQL ExporterQueries, connections, replication9104
Nginx ExporterRequests, connections, response codes9113
03βš™οΈ

Prometheus Configuration

prometheus.yml Explained

YAML# /etc/prometheus/prometheus.yml global: scrape_interval: 15s # How often to collect metrics evaluation_interval: 15s # How often to evaluate alert rules alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"] rule_files: - "alert_rules.yml" # Alert rules file scrape_configs: # Prometheus monitors itself - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # Monitor Linux servers via Node Exporter - job_name: "node-exporter" static_configs: - targets: - "web1:9100" - "web2:9100" - "db1:9100" # Prometheus visits each target every 15 seconds # and collects CPU, RAM, disk, network metrics # Monitor Kubernetes - job_name: "kube-state-metrics" static_configs: - targets: ["kube-state-metrics:8080"]
04πŸ”

PromQL Basics

Query Language for Metrics

PromQL is how you ASK Prometheus questions about your metrics. Think of it as SQL for time-series data. "What is the current CPU usage?" "How many requests per second?" "Show me memory trend over 24 hours."
PromQL ConceptWhat It DoesExample
Instant VectorCurrent value of a metricnode_cpu_seconds_total
Range VectorValues over a time windownode_cpu_seconds_total[5m]
rate()Per-second rate of changerate(http_requests_total[5m])
sum()Add values togethersum(rate(http_requests_total[5m]))
by (label)Group results by labelsum(rate(...)) by (instance)
count()Count number of seriescount(node_cpu_seconds_total)
avg()Average of valuesavg(rate(node_cpu_seconds_total[5m]))
Common PromQL Patterns
PROMQL# Total HTTP requests per second rate(http_requests_total[5m]) # HTTP error rate (5xx errors) sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # Average response latency (95th percentile) histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Number of running pods count(kube_pod_status_phase{phase="Running"})
rate() vs irate()

rate() calculates the average per-second rate over the entire range (smooth, good for alerts). irate() uses only the last two data points (spiky, good for graphs). Use rate() for alerts, irate() for dashboards.

05πŸ–₯️

CPU Monitoring Queries

Complete CPU PromQL with Explanations

These are the exact PromQL queries used in production to monitor CPU usage across all nodes. Each query is explained line by line so you understand what every part does.
1. Total CPU Cores Per Node
PROMQLcount(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}) by (instance) # What this does: # node_cpu_seconds_total β†’ metric from Node Exporter # {mode="idle"} β†’ filter: only idle CPU (one entry per core when idle) # {instance=~".*:9100"} β†’ regex match: all instances on port 9100 # count(...) β†’ count the number of matching series = number of cores # by (instance) β†’ group by server # # Example output: # {instance="web1:9100"} β†’ 4 (web1 has 4 CPU cores) # {instance="web2:9100"} β†’ 8 (web2 has 8 CPU cores)
1a. Used CPU in Cores Per Node
PROMQLsum(rate(node_cpu_seconds_total{mode!="idle", instance=~".*:9100"}[5m])) by (instance) # What this does: # mode!="idle" β†’ all CPU modes EXCEPT idle (user, system, iowait, etc.) # rate(...[5m]) β†’ per-second rate over last 5 minutes # sum(...) β†’ add all non-idle CPU modes together # by (instance) β†’ per server # # Example: {instance="web1:9100"} β†’ 2.3 (using 2.3 out of 4 cores)
1b. Available CPU in Cores Per Node
PROMQLsum(rate(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}[5m])) by (instance) # Same as above but mode="idle" β†’ shows how much CPU is FREE # Example: {instance="web1:9100"} β†’ 1.7 (1.7 cores available)
1c. CPU Usage in Percentage Per Node
PROMQL100 - ( sum(rate(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}[5m])) by (instance) / count(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}) by (instance) * 100 ) # How it works step by step: # 1. sum(rate(idle[5m])) β†’ idle cores (e.g., 1.7) # 2. count(idle) β†’ total cores (e.g., 4) # 3. idle / total * 100 β†’ idle percentage (42.5%) # 4. 100 - idle% β†’ USED percentage (57.5%) # # Example: {instance="web1:9100"} β†’ 57.5 (57.5% CPU used)
1d. Available CPU in Percentage Per Node
PROMQL( sum(rate(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}[5m])) by (instance) / count(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}) by (instance) * 100 ) # Same calculation without the "100 -" β†’ gives available percentage # Example: {instance="web1:9100"} β†’ 42.5 (42.5% CPU available)
06πŸ’Ύ

RAM Monitoring Queries

Complete Memory PromQL with Explanations

2. Total RAM in GB Per Node
PROMQLsum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance) / 1024 / 1024 / 1024 # node_memory_MemTotal_bytes β†’ total RAM in bytes # Divide by 1024 three times β†’ bytes β†’ KB β†’ MB β†’ GB # Example: {instance="web1:9100"} β†’ 15.6 (15.6 GB total RAM)
2a. Available RAM in GB Per Node
PROMQLsum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance) / 1024 / 1024 / 1024 # MemAvailable = memory that can be used without swapping # Example: {instance="web1:9100"} β†’ 8.2 (8.2 GB available)
2b. Used RAM in GB Per Node
PROMQL( sum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance) - sum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance) ) / 1024 / 1024 / 1024 # Total minus Available = Used # Example: 15.6 - 8.2 = 7.4 GB used
2c. Available RAM Percentage Per Node
PROMQL( sum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance) / sum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance) ) * 100 # Available / Total * 100 = Available % # Example: 8.2 / 15.6 * 100 = 52.6% available
2d. Used RAM Percentage Per Node
PROMQL( 1 - ( sum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance) / sum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance) ) ) * 100 # 1 minus (Available/Total) = Used fraction β†’ times 100 = Used % # Example: (1 - 0.526) * 100 = 47.4% RAM used
07☸️

Kubernetes Monitoring Queries

Pod, Node & Cluster Health

These queries monitor your Kubernetes cluster β€” pod status, crashes, scheduling issues, and node health. Essential for any production K8s environment.
Total Running Pods by Phase
PROMQLsum by (phase) (kube_pod_status_phase) # Shows count of pods in each phase: # Running: 45 # Succeeded: 12 # Pending: 2 # Failed: 1 # Unknown: 0
CrashLoopBackOff Pods β€” CRITICAL ALERT
PROMQLkube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0 # Shows pods stuck in CrashLoopBackOff # This means the container keeps crashing and K8s keeps restarting it # ALWAYS set an alert on this β€” it means something is seriously wrong # Common causes: wrong config, missing secrets, OOM killed, bad image
Pending Pods β€” Scheduling Issues
PROMQLkube_pod_status_phase{phase="Pending"} == 1 # Pods stuck in Pending = K8s cannot schedule them # Common causes: # - Not enough CPU/RAM on any node # - Node selector/affinity mismatch # - PVC not bound (storage not available) # - Taints preventing scheduling
Node Health Status
PROMQLkube_node_status_condition{condition="Ready", status="true"} # Shows which nodes are Ready (healthy) # Value 1 = node is Ready # Value 0 = node is NOT Ready (serious problem!) # Alert if any node shows 0 β€” means workloads may be evicted
More Useful K8s Queries
PROMQL# Pods NOT running (failed, pending, unknown) kube_pod_status_phase{phase!="Running", phase!="Succeeded"} == 1 # Container restart count (high restarts = unstable app) sum(kube_pod_container_status_restarts_total) by (pod, namespace) > 5 # Deployment replicas not matching desired kube_deployment_status_replicas_available != kube_deployment_spec_replicas # OOMKilled containers (ran out of memory) kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
08πŸ“Š

Grafana Setup

Beautiful Dashboards

Grafana connects to Prometheus (and Loki) and turns raw metrics into visual dashboards. It does NOT collect data itself β€” it only DISPLAYS data from other sources.
TERMINAL# Install Grafana with Docker docker run -d --name grafana -p 3000:3000 grafana/grafana # Access: http://your-server:3000 # Default login: admin / admin (change immediately!) # Add Prometheus as Data Source: # 1. Settings β†’ Data Sources β†’ Add β†’ Prometheus # 2. URL: http://prometheus:9090 # 3. Click Save & Test
Import Pre-built Dashboards
βœ“Go to Dashboards β†’ Import β†’ Enter Dashboard ID
βœ“Dashboard 1860 β€” Node Exporter Full (CPU, RAM, disk, network per server)
βœ“Dashboard 315 β€” Kubernetes Cluster Monitoring
βœ“Dashboard 13770 β€” Kubernetes Pod Monitoring
βœ“Dashboard 3662 β€” Prometheus 2.0 Overview
βœ“These are FREE community dashboards β€” production-ready in 30 seconds
09πŸ“ˆ

Building Custom Dashboards

Create Your Own Panels

Pre-built dashboards are great, but real DevOps engineers build CUSTOM dashboards tailored to their applications and SLAs.
βœ“Create a new dashboard β†’ Add Panel
βœ“Select data source: Prometheus
βœ“Enter PromQL query (from chapters 5-7)
βœ“Choose visualization: Time Series, Gauge, Stat, Bar, Table
βœ“Set thresholds: green < 60%, yellow < 80%, red >= 80%
βœ“Add variables for dynamic filtering (namespace, instance, pod)
βœ“Save and share with team
10πŸ””

Alertmanager

Get Notified Before Users Complain

Alertmanager receives alerts from Prometheus and routes them to Slack, PagerDuty, email, or Teams. You define alert RULES in Prometheus and ROUTING in Alertmanager.
Alert Rules (in Prometheus)
YAML# /etc/prometheus/alert_rules.yml groups: - name: critical_alerts rules: - alert: HighCPUUsage expr: 100 - (sum(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) / count(node_cpu_seconds_total{mode="idle"}) by (instance) * 100) > 85 for: 5m labels: severity: warning annotations: summary: "High CPU on {{ $labels.instance }}" description: "CPU usage is above 85% for 5 minutes" - alert: PodCrashLooping expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0 for: 2m labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} is CrashLoopBackOff"
Alertmanager Config (routing)
YAML# /etc/alertmanager/alertmanager.yml route: receiver: slack-critical routes: - match: severity: critical receiver: slack-critical - match: severity: warning receiver: slack-warnings receivers: - name: slack-critical slack_configs: - channel: "#alerts-critical" api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" title: "CRITICAL: {{ .GroupLabels.alertname }}" text: "{{ .CommonAnnotations.description }}"
11πŸ“

Loki β€” Log Aggregation

Centralized Logging

Loki is like Prometheus but for LOGS. Instead of metrics (numbers), Loki stores log lines (text). You query logs in Grafana using LogQL β€” similar to PromQL. Think of it as the "grep for your entire infrastructure" β€” search all logs from all servers in one place.
TERMINAL# Install Loki with Docker docker run -d --name loki -p 3100:3100 grafana/loki # Add Loki as Grafana Data Source: # Settings β†’ Data Sources β†’ Add β†’ Loki # URL: http://loki:3100
LogQL β€” Query Language for Logs
LOGQL# Show all logs from nginx {job="nginx"} # Filter logs containing "error" {job="nginx"} |= "error" # Filter logs NOT containing "health" {job="nginx"} != "health" # Regex filter {job="nginx"} |~ "status=(500|502|503)" # Count errors per minute count_over_time({job="nginx"} |= "error" [1m])
12πŸ“¨

Promtail β€” Log Collection

Ship Logs to Loki

Promtail runs on every server, reads log files, and sends them to Loki. Like a postman who picks up letters (logs) from every house (server) and delivers them to the post office (Loki).
YAML# /etc/promtail/config.yml server: http_listen_port: 9080 positions: filename: /tmp/positions.yaml # Remember where we stopped reading clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: [localhost] labels: job: system __path__: /var/log/syslog - job_name: nginx static_configs: - targets: [localhost] labels: job: nginx __path__: /var/log/nginx/*.log - job_name: myapp static_configs: - targets: [localhost] labels: job: myapp __path__: /opt/myapp/logs/*.log
13πŸ—οΈ

Full Observability Stack

Docker Compose Setup

Run the complete monitoring stack with one command. This docker-compose.yml gives you Prometheus + Grafana + Alertmanager + Loki + Promtail β€” the entire observability platform.
DOCKER-COMPOSE# docker-compose.yml β€” Full monitoring stack version: "3" services: prometheus: image: prom/prometheus ports: ["9090:9090"] volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alert_rules.yml:/etc/prometheus/alert_rules.yml grafana: image: grafana/grafana ports: ["3000:3000"] environment: GF_SECURITY_ADMIN_PASSWORD: admin123 alertmanager: image: prom/alertmanager ports: ["9093:9093"] volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml loki: image: grafana/loki ports: ["3100:3100"] promtail: image: grafana/promtail volumes: - /var/log:/var/log - ./promtail.yml:/etc/promtail/config.yml node-exporter: image: prom/node-exporter ports: ["9100:9100"] # Start everything: docker-compose up -d # Grafana: http://localhost:3000 (admin/admin123) # Prometheus: http://localhost:9090
14πŸ†

Best Practices

Production Monitoring Patterns

βœ“Monitor the 4 Golden Signals: latency, traffic, errors, saturation
βœ“Set alerts on symptoms (high error rate) not causes (high CPU)
βœ“Use rate() over 5m for alerts, 1m for dashboards
βœ“Dashboard per team: platform dashboard, application dashboard, business dashboard
βœ“Retention: 15 days local Prometheus, long-term in Thanos/Cortex
βœ“Label cardinality: avoid high-cardinality labels (user IDs, request IDs)
βœ“Alert fatigue: only alert on actionable issues. If you ignore an alert, delete it
βœ“Loki for logs, Prometheus for metrics β€” do NOT put logs in Prometheus
βœ“Always monitor your monitoring: if Prometheus is down, who alerts you?
15πŸ’Ό

Interview Questions

Monitoring & Observability Q&A

❓
Prometheus vs CloudWatch?
Prometheus: open-source, pull-based, PromQL, self-hosted. CloudWatch: AWS-native, push-based, limited queries. Prometheus is more powerful; CloudWatch is zero-setup for AWS.
❓
What is PromQL?
Query language for Prometheus metrics. Like SQL for time-series data. Key functions: rate(), sum(), count(), histogram_quantile(). Used in Grafana panels and alert rules.
❓
Pull vs Push monitoring?
Pull (Prometheus): scrapes targets every 15s. Push (CloudWatch, Datadog): services send metrics to collector. Pull is better for service discovery; Push for short-lived jobs.
❓
What are the 4 Golden Signals?
Latency (response time), Traffic (requests/sec), Errors (error rate), Saturation (resource usage). Monitor these 4 and you cover 90% of issues.
❓
Grafana vs Kibana?
Grafana: multi-source (Prometheus, Loki, CloudWatch), best for metrics dashboards. Kibana: Elasticsearch only, best for log analysis. Most teams use Grafana for metrics + Loki for logs.
❓
Loki vs ELK Stack?
Loki: lightweight, indexes only labels (not full text), cheap storage. ELK (Elasticsearch): indexes everything, powerful search, expensive. Loki is 10x cheaper for most use cases.
❓
What is Alertmanager?
Receives alerts from Prometheus, deduplicates, groups, and routes to Slack/PagerDuty/email. Supports silencing, inhibition, and routing by severity.
❓
How to monitor K8s?
kube-state-metrics for pod/deploy/node status. Node Exporter for server metrics. cAdvisor for container metrics. PromQL queries for CrashLoopBackOff, Pending pods, OOMKilled.