Production-Grade Monitoring & Alerts

In modern DevOps, observability plays a critical role in keeping applications healthy, reliable, and scalable. Monitoring and alerting systems help detect issues early and allow teams to respond before they impact users.

What is Monitoring in DevOps?

Monitoring in DevOps refers to continuously collecting and analyzing system metrics to understand application performance and health in real time.

CPU usage
Memory usage
Request rate
Error rates

? Goal: To know what is happening inside your system in real time and quickly detect any issues.

Metrics vs Logs vs Traces

Understanding metrics, logs, and traces is essential for effective monitoring and observability in DevOps.

Metrics

Metrics are numerical values that show the performance and health of your system over time.

Numeric data (CPU, memory, request count)
Used for monitoring dashboards and alerting

Example:

CPU usage = 75%

Requests per second = 120

Logs

Logs are detailed records of events that happen inside your application or system.

Capture errors, warnings, and system activities
Mainly used for debugging issues

Example:

“User login failed due to invalid password”

“Database connection timeout”

Traces

Traces track the complete journey of a request across multiple services.

Show how a request flows through different components
Very important in microservices architecture

Example:

User Request → API Gateway → Auth Service → Order Service → Database

Metrics vs Logs vs Traces

Feature	Metrics	Logs	Traces
Type of Data	Numeric data (CPU, memory, requests)	Event-based records	Request flow data
Purpose	Monitoring & alerting	Debugging issues	Tracking request journey
Usage	Dashboards, alerts	Error analysis	Microservices tracing
Example	CPU = 75%	“User login failed”	Request → API → Service → DB

Monitoring & Observability Tools

Modern DevOps systems rely on observability tools to understand what is happening inside applications running in production. These tools help you detect issues, analyze performance, and improve system reliability.

Prometheus

Prometheus is one of the most widely used monitoring systems for collecting and storing time-series metrics

It scrapes metrics from applications and services (pull model).
Stores data as time-series (timestamp + value).
Uses PromQL (Prometheus Query Language) to analyze data.

Example Use Case

You can monitor:

CPU usage of a pod
Memory consumption
Request rate per second

Example metric:

http_requests_total

Example PromQL query:

rate(http_requests_total[5m])

Grafana

Grafana is a powerful visualization tool used to create real-time dashboards from monitoring data.

Connects to Prometheus, Elasticsearch, and other data sources.
Converts raw metrics into charts, graphs, and dashboards.
Provides real-time visualization.

Example Use Case

A DevOps dashboard showing:

CPU usage over time
Memory usage trends
API request traffic
Error rates

? This helps teams quickly detect system issues visually.

ELK Stack

ELK Stack is used for collecting, processing, and analyzing logs from applications and infrastructure.

Components of ELK Stack

Elasticsearch

Stores and indexes logs
Fast search engine for large datasets

Logstash

Collects logs from different sources
Processes and transforms log data

Kibana

Visualizes logs in dashboards
Helps in searching and filtering logs

Filebeat / Fluentd (Log Shippers)

Lightweight agents installed on servers or containers
Collect application logs and forward them
Reduce load on Logstash
Improve performance and scalability

How ELK Works

Application Logs → Fluentd/Filebeat → Logstash → Elasticsearch → Kibana

Example Use Case

You can analyze:

Application errors (500 status codes)
Security logs (failed login attempts)
System crashes and exceptions

Example log:

ERROR 500: Internal Server Error at /api/user

Setting Up Prometheus (Basic Example)

Prometheus is used to collect metrics from Kubernetes.

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
      - job_name: 'kubernetes'
        static_configs:
          - targets: ['localhost:9090']

Deploy Prometheus (simplified):

kubectl apply -f prometheus.yaml

Creating Dashboards in Grafana

Grafana is used to visualize metrics collected by Prometheus in the form of dashboards.

Steps to Create Dashboard

Connect Prometheus as Data Source
- Add Prometheus URL in Grafana settings
Create a Dashboard
- Click “Create Dashboard”
- Add a new panel
Add Panels
- Select metrics and display them as graphs

Example metrics:

CPU usage
Memory consumption
Pod restarts

Logging Setup Using ELK Stack

ELK Stack is used to collect, store, and analyze logs from applications in Kubernetes.

Log Flow

Application Logs → Logstash → Elasticsearch → Kibana

Kubernetes Logging Agents

In Kubernetes, logs are usually collected using:

Fluentd
Filebeat

? These agents send logs to the ELK pipeline.

What Each Component Does

Logstash → Processes logs
Elasticsearch → Stores and indexes logs
Kibana → Visualizes logs in dashboards

Alerting Basics

Monitoring without alerts is not useful in production.

Common Alerts

CPU Alert

- alert: HighCPUUsage
  expr: node_cpu_usage > 80
  for: 1m

Memory Alert

- alert: HighMemoryUsage
  expr: node_memory_usage > 80
  for: 1m

Pod Failure Alert

- alert: PodDown
  expr: kube_pod_status_phase{phase="Failed"} > 0

Alertmanager Configuration

Alertmanager is used to send and manage alerts like email or Slack notifications.

route:
  receiver: email-alert

receivers:
- name: email-alert
  email_configs:
  - to: your-email@example.com

Integrating Alerts

Email Alerts

Used to send notifications directly to your inbox
Helps in quick incident awareness

Slack Alerts

Alertmanager can also send alerts to Slack channels.

receivers:
- name: slack-alert
  slack_configs:
  - api_url: <https://hooks.slack.com/services/XXX>
    channel: '#alerts'

Health Checks & Uptime Monitoring

Kubernetes supports built-in health checks to ensure your application is running properly.

Liveness Probe → Checks if the app is alive (restarts if failed)
Readiness Probe → Checks if the app is ready to serve traffic

Example Liveness Probe

livenessProbe:
  httpGet:
    path: /
    port: 80
  initialDelaySeconds: 5
  periodSeconds: 10

Distributed Tracing Basics

Jaeger

Tracks requests across multiple services
Helps identify slow services and bottlenecks

OpenTelemetry

Standard tool for traces, logs, and metrics
Works with many observability platforms

SLI (Service Level Indicator)

SLI means the actual performance measurement of a system in real time. It shows what is really happening in production.

Example:

Uptime = 99%
API response time = 250ms
Error rate = 1%

These measured values are called SLI.

SLO (Service Level Objective)

SLO means the target or goal set for system performance. It defines what level of performance you want to maintain.

Example:

Uptime target = 99.9%
Response time should be less than 300ms
Error rate should be less than 0.5%

If the system goes below this target, alerts are triggered.

SLA (Service Level Agreement)

SLA means a formal agreement between a company and its customers. It defines the promised level of service and consequences if it is not met.

Example:

If uptime is below 99.5%, the customer may receive a refund or credit
It is part of a legal or business contract

Difference Between SLI, SLO, SLA

Feature	SLI (Service Level Indicator)	SLO (Service Level Objective)	SLA (Service Level Agreement)
Meaning	Actual measured performance of a system	Target performance goal	Formal promise to customers
Purpose	Shows real system behavior	Defines expected performance level	Defines service commitment and penalty
Nature	Metric / Data	Goal / Target	Legal / Contract
Based on	Monitoring data	Business requirement	Customer agreement
Example	99% uptime, 250ms response time	99.9% uptime target	Refund if uptime < 99.5%
Usage	Used in monitoring tools	Used in system design	Used in contracts

Best Practices

Detect issues quickly using alerts
Find root cause using logs and traces
Fix issue (restart, rollback, scale)
Inform team about the incident
Write postmortem after resolution

Conclusion

Production-grade monitoring is essential for running reliable Kubernetes systems. It helps you continuously track system health, detect issues early, and respond quickly to failures.

A complete monitoring setup includes:

Prometheus for collecting metrics
Grafana for visual dashboards
ELK Stack for log analysis
Alertmanager for alerts and notifications

This combination ensures full visibility into your applications and infrastructure, making your system stable, scalable, and production-ready.