Prometheus, Grafana, Loki, Promtail β from installation to production dashboards with real PromQL queries for CPU, RAM, and Kubernetes monitoring.
15
Chapters
30+
PromQL
100%
Free
01π
Introduction to Monitoring
Why You Need Observability
Monitoring answers one question: "Is my system healthy RIGHT NOW?" Without monitoring, you only find out something is broken when users complain. With monitoring, you detect and fix issues BEFORE users notice. The modern observability stack: Prometheus (metrics), Grafana (dashboards), Loki (logs), Promtail (log collection), Alertmanager (alerts).
π
Metrics
Numbers over time β CPU 73%, memory 4.2 GB, requests 1500/sec. Prometheus collects these.
π
Logs
Text output from apps β error messages, request logs. Loki + Promtail collect these.
π
Dashboards
Visual graphs and charts. Grafana turns raw metrics into beautiful, actionable dashboards.
π
Alerts
Automatic notifications when something goes wrong. Alertmanager sends to Slack, PagerDuty, email.
Tool
What It Does
Think of It As
Prometheus
Collects and stores metrics
The data collector β scrapes numbers from your servers every 15 seconds
Grafana
Visualizes metrics
The TV screen β shows beautiful graphs and dashboards
Alertmanager
Sends alerts
The alarm system β pages you when CPU > 90%
Loki
Stores logs
The filing cabinet β stores all your application logs
Promtail
Collects logs
The mail carrier β sends logs from servers to Loki
02π₯
Prometheus
The Metrics Engine
Prometheus PULLS metrics from your services every 15 seconds. Your service exposes a /metrics endpoint with numbers, Prometheus scrapes it, stores it in its time-series database, and you query it with PromQL. Think of it as a health inspector who visits every restaurant every 15 seconds and writes down temperature, hygiene score, and customer count.
Installation
TERMINAL# Docker (quickest)
docker run -d --name prometheus -p 9090:9090 \
-v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# Access: http://your-server:9090
# Key exporters you need:
# Node Exporter β CPU, RAM, disk metrics (install on every server)
# kube-state-metrics β Kubernetes pod/node/deploy metrics
# cAdvisor β Container-level metrics
Key Exporters
Exporter
Metrics It Provides
Port
Node Exporter
CPU, RAM, disk, network for Linux servers
9100
kube-state-metrics
Pod status, deployments, node conditions
8080
cAdvisor
Container CPU, memory, network
8080
MySQL Exporter
Queries, connections, replication
9104
Nginx Exporter
Requests, connections, response codes
9113
03βοΈ
Prometheus Configuration
prometheus.yml Explained
YAML# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s # How often to collect metrics
evaluation_interval: 15s # How often to evaluate alert rules
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "alert_rules.yml" # Alert rules file
scrape_configs:
# Prometheus monitors itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Monitor Linux servers via Node Exporter
- job_name: "node-exporter"
static_configs:
- targets:
- "web1:9100"
- "web2:9100"
- "db1:9100"
# Prometheus visits each target every 15 seconds
# and collects CPU, RAM, disk, network metrics
# Monitor Kubernetes
- job_name: "kube-state-metrics"
static_configs:
- targets: ["kube-state-metrics:8080"]
04π
PromQL Basics
Query Language for Metrics
PromQL is how you ASK Prometheus questions about your metrics. Think of it as SQL for time-series data. "What is the current CPU usage?" "How many requests per second?" "Show me memory trend over 24 hours."
PromQL Concept
What It Does
Example
Instant Vector
Current value of a metric
node_cpu_seconds_total
Range Vector
Values over a time window
node_cpu_seconds_total[5m]
rate()
Per-second rate of change
rate(http_requests_total[5m])
sum()
Add values together
sum(rate(http_requests_total[5m]))
by (label)
Group results by label
sum(rate(...)) by (instance)
count()
Count number of series
count(node_cpu_seconds_total)
avg()
Average of values
avg(rate(node_cpu_seconds_total[5m]))
Common PromQL Patterns
PROMQL# Total HTTP requests per second
rate(http_requests_total[5m])
# HTTP error rate (5xx errors)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# Average response latency (95th percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Number of running pods
count(kube_pod_status_phase{phase="Running"})
rate() vs irate()
rate() calculates the average per-second rate over the entire range (smooth, good for alerts). irate() uses only the last two data points (spiky, good for graphs). Use rate() for alerts, irate() for dashboards.
05π₯οΈ
CPU Monitoring Queries
Complete CPU PromQL with Explanations
These are the exact PromQL queries used in production to monitor CPU usage across all nodes. Each query is explained line by line so you understand what every part does.
1. Total CPU Cores Per Node
PROMQLcount(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}) by (instance)
# What this does:
# node_cpu_seconds_total β metric from Node Exporter
# {mode="idle"} β filter: only idle CPU (one entry per core when idle)
# {instance=~".*:9100"} β regex match: all instances on port 9100
# count(...) β count the number of matching series = number of cores
# by (instance) β group by server
#
# Example output:
# {instance="web1:9100"} β 4 (web1 has 4 CPU cores)
# {instance="web2:9100"} β 8 (web2 has 8 CPU cores)
1a. Used CPU in Cores Per Node
PROMQLsum(rate(node_cpu_seconds_total{mode!="idle", instance=~".*:9100"}[5m])) by (instance)
# What this does:
# mode!="idle" β all CPU modes EXCEPT idle (user, system, iowait, etc.)
# rate(...[5m]) β per-second rate over last 5 minutes
# sum(...) β add all non-idle CPU modes together
# by (instance) β per server
#
# Example: {instance="web1:9100"} β 2.3 (using 2.3 out of 4 cores)
1b. Available CPU in Cores Per Node
PROMQLsum(rate(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}[5m])) by (instance)
# Same as above but mode="idle" β shows how much CPU is FREE
# Example: {instance="web1:9100"} β 1.7 (1.7 cores available)
1c. CPU Usage in Percentage Per Node
PROMQL100 - (
sum(rate(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}[5m])) by (instance)
/
count(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}) by (instance)
* 100
)
# How it works step by step:
# 1. sum(rate(idle[5m])) β idle cores (e.g., 1.7)
# 2. count(idle) β total cores (e.g., 4)
# 3. idle / total * 100 β idle percentage (42.5%)
# 4. 100 - idle% β USED percentage (57.5%)
#
# Example: {instance="web1:9100"} β 57.5 (57.5% CPU used)
1d. Available CPU in Percentage Per Node
PROMQL(
sum(rate(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}[5m])) by (instance)
/
count(node_cpu_seconds_total{mode="idle", instance=~".*:9100"}) by (instance)
* 100
)
# Same calculation without the "100 -" β gives available percentage
# Example: {instance="web1:9100"} β 42.5 (42.5% CPU available)
06πΎ
RAM Monitoring Queries
Complete Memory PromQL with Explanations
2. Total RAM in GB Per Node
PROMQLsum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance)
/ 1024 / 1024 / 1024
# node_memory_MemTotal_bytes β total RAM in bytes
# Divide by 1024 three times β bytes β KB β MB β GB
# Example: {instance="web1:9100"} β 15.6 (15.6 GB total RAM)
2a. Available RAM in GB Per Node
PROMQLsum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance)
/ 1024 / 1024 / 1024
# MemAvailable = memory that can be used without swapping
# Example: {instance="web1:9100"} β 8.2 (8.2 GB available)
2b. Used RAM in GB Per Node
PROMQL(
sum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance)
-
sum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance)
) / 1024 / 1024 / 1024
# Total minus Available = Used
# Example: 15.6 - 8.2 = 7.4 GB used
2c. Available RAM Percentage Per Node
PROMQL(
sum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance)
/
sum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance)
) * 100
# Available / Total * 100 = Available %
# Example: 8.2 / 15.6 * 100 = 52.6% available
2d. Used RAM Percentage Per Node
PROMQL(
1 - (
sum(node_memory_MemAvailable_bytes{instance!~"default/node-exporter-.*"}) by (instance)
/
sum(node_memory_MemTotal_bytes{instance!~"default/node-exporter-.*"}) by (instance)
)
) * 100
# 1 minus (Available/Total) = Used fraction β times 100 = Used %
# Example: (1 - 0.526) * 100 = 47.4% RAM used
07βΈοΈ
Kubernetes Monitoring Queries
Pod, Node & Cluster Health
These queries monitor your Kubernetes cluster β pod status, crashes, scheduling issues, and node health. Essential for any production K8s environment.
Total Running Pods by Phase
PROMQLsum by (phase) (kube_pod_status_phase)
# Shows count of pods in each phase:
# Running: 45
# Succeeded: 12
# Pending: 2
# Failed: 1
# Unknown: 0
CrashLoopBackOff Pods β CRITICAL ALERT
PROMQLkube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
# Shows pods stuck in CrashLoopBackOff
# This means the container keeps crashing and K8s keeps restarting it
# ALWAYS set an alert on this β it means something is seriously wrong
# Common causes: wrong config, missing secrets, OOM killed, bad image
Pending Pods β Scheduling Issues
PROMQLkube_pod_status_phase{phase="Pending"} == 1
# Pods stuck in Pending = K8s cannot schedule them
# Common causes:
# - Not enough CPU/RAM on any node
# - Node selector/affinity mismatch
# - PVC not bound (storage not available)
# - Taints preventing scheduling
Node Health Status
PROMQLkube_node_status_condition{condition="Ready", status="true"}
# Shows which nodes are Ready (healthy)
# Value 1 = node is Ready
# Value 0 = node is NOT Ready (serious problem!)
# Alert if any node shows 0 β means workloads may be evicted
More Useful K8s Queries
PROMQL# Pods NOT running (failed, pending, unknown)
kube_pod_status_phase{phase!="Running", phase!="Succeeded"} == 1
# Container restart count (high restarts = unstable app)
sum(kube_pod_container_status_restarts_total) by (pod, namespace) > 5
# Deployment replicas not matching desired
kube_deployment_status_replicas_available != kube_deployment_spec_replicas
# OOMKilled containers (ran out of memory)
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
08π
Grafana Setup
Beautiful Dashboards
Grafana connects to Prometheus (and Loki) and turns raw metrics into visual dashboards. It does NOT collect data itself β it only DISPLAYS data from other sources.
TERMINAL# Install Grafana with Docker
docker run -d --name grafana -p 3000:3000 grafana/grafana
# Access: http://your-server:3000
# Default login: admin / admin (change immediately!)
# Add Prometheus as Data Source:
# 1. Settings β Data Sources β Add β Prometheus
# 2. URL: http://prometheus:9090
# 3. Click Save & Test
Import Pre-built Dashboards
βGo to Dashboards β Import β Enter Dashboard ID
βDashboard 1860 β Node Exporter Full (CPU, RAM, disk, network per server)
βThese are FREE community dashboards β production-ready in 30 seconds
09π
Building Custom Dashboards
Create Your Own Panels
Pre-built dashboards are great, but real DevOps engineers build CUSTOM dashboards tailored to their applications and SLAs.
βCreate a new dashboard β Add Panel
βSelect data source: Prometheus
βEnter PromQL query (from chapters 5-7)
βChoose visualization: Time Series, Gauge, Stat, Bar, Table
βSet thresholds: green < 60%, yellow < 80%, red >= 80%
βAdd variables for dynamic filtering (namespace, instance, pod)
βSave and share with team
10π
Alertmanager
Get Notified Before Users Complain
Alertmanager receives alerts from Prometheus and routes them to Slack, PagerDuty, email, or Teams. You define alert RULES in Prometheus and ROUTING in Alertmanager.
Alert Rules (in Prometheus)
YAML# /etc/prometheus/alert_rules.yml
groups:
- name: critical_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (sum(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) / count(node_cpu_seconds_total{mode="idle"}) by (instance) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage is above 85% for 5 minutes"
- alert: PodCrashLooping
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is CrashLoopBackOff"
Loki is like Prometheus but for LOGS. Instead of metrics (numbers), Loki stores log lines (text). You query logs in Grafana using LogQL β similar to PromQL. Think of it as the "grep for your entire infrastructure" β search all logs from all servers in one place.
TERMINAL# Install Loki with Docker
docker run -d --name loki -p 3100:3100 grafana/loki
# Add Loki as Grafana Data Source:
# Settings β Data Sources β Add β Loki
# URL: http://loki:3100
LogQL β Query Language for Logs
LOGQL# Show all logs from nginx
{job="nginx"}
# Filter logs containing "error"
{job="nginx"} |= "error"
# Filter logs NOT containing "health"
{job="nginx"} != "health"
# Regex filter
{job="nginx"} |~ "status=(500|502|503)"
# Count errors per minute
count_over_time({job="nginx"} |= "error" [1m])
12π¨
Promtail β Log Collection
Ship Logs to Loki
Promtail runs on every server, reads log files, and sends them to Loki. Like a postman who picks up letters (logs) from every house (server) and delivers them to the post office (Loki).
Run the complete monitoring stack with one command. This docker-compose.yml gives you Prometheus + Grafana + Alertmanager + Loki + Promtail β the entire observability platform.
βAlert fatigue: only alert on actionable issues. If you ignore an alert, delete it
βLoki for logs, Prometheus for metrics β do NOT put logs in Prometheus
βAlways monitor your monitoring: if Prometheus is down, who alerts you?
15πΌ
Interview Questions
Monitoring & Observability Q&A
β
Prometheus vs CloudWatch?
Prometheus: open-source, pull-based, PromQL, self-hosted. CloudWatch: AWS-native, push-based, limited queries. Prometheus is more powerful; CloudWatch is zero-setup for AWS.
β
What is PromQL?
Query language for Prometheus metrics. Like SQL for time-series data. Key functions: rate(), sum(), count(), histogram_quantile(). Used in Grafana panels and alert rules.
β
Pull vs Push monitoring?
Pull (Prometheus): scrapes targets every 15s. Push (CloudWatch, Datadog): services send metrics to collector. Pull is better for service discovery; Push for short-lived jobs.
β
What are the 4 Golden Signals?
Latency (response time), Traffic (requests/sec), Errors (error rate), Saturation (resource usage). Monitor these 4 and you cover 90% of issues.
β
Grafana vs Kibana?
Grafana: multi-source (Prometheus, Loki, CloudWatch), best for metrics dashboards. Kibana: Elasticsearch only, best for log analysis. Most teams use Grafana for metrics + Loki for logs.
β
Loki vs ELK Stack?
Loki: lightweight, indexes only labels (not full text), cheap storage. ELK (Elasticsearch): indexes everything, powerful search, expensive. Loki is 10x cheaper for most use cases.
β
What is Alertmanager?
Receives alerts from Prometheus, deduplicates, groups, and routes to Slack/PagerDuty/email. Supports silencing, inhibition, and routing by severity.
β
How to monitor K8s?
kube-state-metrics for pod/deploy/node status. Node Exporter for server metrics. cAdvisor for container metrics. PromQL queries for CrashLoopBackOff, Pending pods, OOMKilled.