DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Most teams scale on CPU averages. That’s easy-and often wrong. Align autoscaling with your p95 latency SLO instead.

Why CPU avg wastes money

  • CPU <> user experience; tails come from queuing and saturation
  • CPU doesn’t reflect request queuing or tail latency
  • Burst patterns create thrash (scale too late, stay scaled too long), You scale late during bursts and stay high too long
  • You pay more while still missing p95 targets

The p95 approach (5 steps)

1) Define SLO (e.g., 95% of requests < 250 ms) 2) Add instrumentation for p95 latency (per service/endpoint) 3) Configure autoscaling on custom metrics (p95 and/or request queue depth) 4) Add hysteresis and cool-down to avoid flapping 5) Protect queues with lag thresholds and exponential backoff 6) Set cost guardrails (budgets, quotas, anomaly detection)

What changes in practice

  • Scale before queue explosion; scale down when tails calm
  • Fewer retry storms and less error amplification
  • Better cost/perf ratio with the same infrastructure

Quick wins (1-2 weeks)

  • Export p95 via OTel/Datadog; create a custom metric
  • Tune HPA policies to p95 and queue lag (not CPU)
  • Right-size instances; fix one noisy endpoint
  • Add daily budget alerts + lifecycle rules for logs

Example: Kubernetes HPA (custom metrics)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  behavior:
    scaleDown: { stabilizationWindowSeconds: 300 }
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_p95_latency_ms
      target:
        type: AverageValue
        averageValue: "250"

Example: Queue scaling with KEDA (Pub/Sub/Rabbit/Kafka)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-queue
spec:
  scaleTargetRef:
    kind: Deployment
    name: worker
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: queue_lag
      threshold: "1000"

What to measure

  • p95 during burst and steady windows
  • Queue lag vs. consumer throughput
  • Spend / 1000 requests (cost-to-serve)

Result: Typically 20-30% cost reduction with fewer incidents-no code rewrite required.


Real-World Example: Before and After

Before (CPU-based scaling):

  • Average CPU: 45%
  • p95 latency: 450ms (SLO: 250ms)
  • Instances: 10-25 (thrashing)
  • Monthly cost: €2,400
  • Incidents: 3 per month (latency spikes)

After (p95-based scaling):

  • Average CPU: 60% (higher utilization)
  • p95 latency: 180ms (within SLO)
  • Instances: 5-12 (stable)
  • Monthly cost: €1,680 (30% reduction)
  • Incidents: 0 per month

Key insight: CPU doesn’t reflect user experience. p95 latency does.


Implementation Guide

Step 1: Instrument p95 Latency

For Kubernetes (Prometheus):

# Add to your deployment
apiVersion: v1
kind: Service
metadata:
  name: api
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"
spec:
  ports:
  - port: 80
    targetPort: 8080
---
# In your application code (Go example)
import (
  "github.com/prometheus/client_golang/prometheus"
  "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
  httpRequestDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
      Name: "http_request_duration_seconds",
      Help: "HTTP request duration in seconds",
      Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0},
    },
    []string{"method", "endpoint", "status"},
  )
)

func handler(w http.ResponseWriter, r *http.Request) {
  start := time.Now()
  // ... your handler logic ...
  duration := time.Since(start).Seconds()
  httpRequestDuration.WithLabelValues(
    r.Method,
    r.URL.Path,
    strconv.Itoa(statusCode),
  ).Observe(duration)
}

For Cloud Run / Cloud Functions:

  • Use OpenTelemetry or Cloud Monitoring client libraries
  • Export custom metrics to Cloud Monitoring
  • Create custom metrics from request logs

Step 2: Create Custom Metrics

Prometheus recording rule:

groups:
- name: slo
  interval: 30s
  rules:
  - record: http_p95_latency_ms
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
      ) * 1000

Cloud Monitoring custom metric:

# Python example for Cloud Run
from google.cloud import monitoring_v3
import time

client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"

# Create custom metric descriptor
descriptor = monitoring_v3.MetricDescriptor()
descriptor.type = "custom.googleapis.com/http_p95_latency_ms"
descriptor.metric_kind = monitoring_v3.MetricDescriptor.MetricKind.GAUGE
descriptor.value_type = monitoring_v3.MetricDescriptor.ValueType.DOUBLE
descriptor.description = "p95 latency in milliseconds"

client.create_metric_descriptor(
    name=project_name, metric_descriptor=descriptor
)

# Write metric
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/http_p95_latency_ms"
series.resource.type = "cloud_run_revision"
series.resource.labels["service_name"] = "api"
series.resource.labels["revision_name"] = "api-001"

point = monitoring_v3.Point()
point.value.double_value = p95_latency_ms
point.interval.end_time.seconds = int(time.time())
series.points = [point]

client.create_time_series(name=project_name, time_series=[series])

Step 3: Configure HPA with Custom Metrics

Kubernetes HPA with Prometheus adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min cooldown
      policies:
      - type: Percent
        value: 50  # Scale down max 50% at a time
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Can double replicas
        periodSeconds: 30
      - type: Pods
        value: 4  # Or add 4 pods
        periodSeconds: 30
      selectPolicy: Max  # Use the more aggressive policy
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_p95_latency_ms
      target:
        type: AverageValue
        averageValue: "250"  # Target: 250ms p95
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80  # Fallback to CPU if latency metric unavailable

Troubleshooting HPA:

# Check HPA status
kubectl get hpa api-hpa

# Describe HPA to see scaling decisions
kubectl describe hpa api-hpa

# Check if metrics are available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_p95_latency_ms"

Step 4: Add Queue-Based Scaling (Optional)

For services with queues (Pub/Sub, RabbitMQ, Kafka), add queue lag metrics:

KEDA ScaledObject for Pub/Sub:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-queue
spec:
  scaleTargetRef:
    kind: Deployment
    name: worker
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: pubsub_subscription_num_undelivered_messages
      threshold: "1000"  # Scale when > 1000 messages
      query: |
        sum(pubsub_subscription_num_undelivered_messages{subscription="worker-queue"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: http_p95_latency_ms
      threshold: "250"
      query: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
        ) * 1000

Cloud Run / Cloud Functions Alternative

If you’re not on Kubernetes, use Cloud Run with custom metrics:

1. Export p95 latency as custom metric:

# In your Cloud Run service
from google.cloud import monitoring_v3
import time

def write_p95_metric(p95_latency_ms):
    client = monitoring_v3.MetricServiceClient()
    project_name = f"projects/{PROJECT_ID}"
    
    series = monitoring_v3.TimeSeries()
    series.metric.type = "custom.googleapis.com/http_p95_latency_ms"
    series.resource.type = "cloud_run_revision"
    series.resource.labels["service_name"] = "api"
    
    point = monitoring_v3.Point()
    point.value.double_value = p95_latency_ms
    point.interval.end_time.seconds = int(time.time())
    series.points = [point]
    
    client.create_time_series(name=project_name, time_series=[series])

2. Create alerting policy:

# Cloud Monitoring alerting policy
displayName: "High p95 Latency"
conditions:
- displayName: "p95 latency > 250ms"
  conditionThreshold:
    filter: |
      resource.type = "cloud_run_revision"
      metric.type = "custom.googleapis.com/http_p95_latency_ms"
    comparison: COMPARISON_GT
    thresholdValue: 250
    duration: 300s  # 5 minutes
notificationChannels:
- projects/PROJECT_ID/notificationChannels/CHANNEL_ID

3. Adjust Cloud Run settings:

# Increase min instances when latency is high
gcloud run services update api \
  --min-instances=3 \
  --concurrency=40 \
  --max-instances=50 \
  --cpu-boost

4. Use Cloud Scheduler to scale based on metrics:

# Create Cloud Function that scales based on custom metrics
# Triggered by Cloud Monitoring alerts

Monitoring and Alerting

Key Metrics Dashboard

Grafana dashboard queries:

# p95 latency over time
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) * 1000

# Instance count
count(kube_pod_info{pod=~"api-.*"})

# Cost per 1000 requests (if you have cost metrics)
sum(rate(cloud_cost_euros[1h])) / sum(rate(http_requests_total[1h])) * 1000

# Queue lag (if applicable)
sum(pubsub_subscription_num_undelivered_messages)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))

Alerting Rules

groups:
- name: autoscaling
  rules:
  - alert: HighP95Latency
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      ) * 1000 > 250
    for: 5m
    annotations:
      summary: "p95 latency above SLO threshold"
      description: "p95 latency is ms, target is 250ms"
  
  - alert: ScalingThrashing
    expr: |
      rate(kube_horizontalpodautoscaler_status_current_replicas[5m]) > 0.5
    for: 10m
    annotations:
      summary: "HPA is thrashing (scaling up/down rapidly)"
      description: "Replica count changing too frequently"
  
  - alert: QueueLagHigh
    expr: |
      sum(pubsub_subscription_num_undelivered_messages) > 10000
    for: 5m
    annotations:
      summary: "Queue lag is high"
      description: " messages undelivered"

Common Pitfalls and Solutions

Pitfall 1: Metric Delay

Problem: p95 latency metric has 30-60 second delay, causing late scaling.

Solution:

  • Use shorter aggregation windows (1-2 minutes)
  • Combine with queue lag metrics (real-time)
  • Add predictive scaling based on traffic patterns

Pitfall 2: Noisy Metrics

Problem: p95 spikes from occasional slow requests, causing unnecessary scaling.

Solution:

  • Use longer aggregation windows (5 minutes)
  • Add minimum duration before scaling (stabilization window)
  • Filter out known slow endpoints (health checks, admin APIs)

Pitfall 3: Cold Start Latency

Problem: New instances have high latency during startup, triggering more scaling.

Solution:

  • Set minimum instances to avoid cold starts
  • Use startup CPU boost (Cloud Run)
  • Exclude startup period from latency calculations

Pitfall 4: Downstream Dependency Issues

Problem: High latency from database/API, not your service.

Solution:

  • Monitor downstream dependencies separately
  • Scale based on queue depth, not just latency
  • Implement circuit breakers to fail fast

Cost Optimization Tips

  1. Right-size instances: Use smaller instances with higher concurrency
  2. Set appropriate min/max: Don’t over-provision, but avoid cold starts
  3. Use spot/preemptible instances: For non-critical workloads
  4. Implement cost guardrails: Budget alerts, quota limits
  5. Monitor cost-to-serve: Track €/1000 requests over time

Expected savings:

  • 20-30% cost reduction from better utilization
  • 50-70% fewer incidents from proactive scaling
  • Better user experience from consistent latency

Next Steps

  1. Week 1: Instrument p95 latency, create custom metrics
  2. Week 2: Configure HPA with p95-based scaling
  3. Week 3: Add queue-based scaling (if applicable)
  4. Week 4: Tune thresholds, add alerts, measure impact

Cloud Run/Gateway note: If you’re not on K8s, tune concurrency, min/max instances, and trigger scale via custom metrics + alerts.

Want help implementing this?


© Copyright 2017-2025