DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Most teams scale on CPU averages. That’s easy-and often wrong. Align autoscaling with your p95 latency SLO instead.

Why CPU avg wastes money

  • CPU <> user experience; tails come from queuing and saturation
  • CPU doesn’t reflect request queuing or tail latency
  • Burst patterns create thrash (scale too late, stay scaled too long), You scale late during bursts and stay high too long
  • You pay more while still missing p95 targets

The p95 approach (5 steps)

1) Define SLO (e.g., 95% of requests < 250 ms) 2) Add instrumentation for p95 latency (per service/endpoint) 3) Configure autoscaling on custom metrics (p95 and/or request queue depth) 4) Add hysteresis and cool-down to avoid flapping 5) Protect queues with lag thresholds and exponential backoff 6) Set cost guardrails (budgets, quotas, anomaly detection)

What changes in practice

  • Scale before queue explosion; scale down when tails calm
  • Fewer retry storms and less error amplification
  • Better cost/perf ratio with the same infrastructure

Quick wins (1-2 weeks)

  • Export p95 via OTel/Datadog; create a custom metric
  • Tune HPA policies to p95 and queue lag (not CPU)
  • Right-size instances; fix one noisy endpoint
  • Add daily budget alerts + lifecycle rules for logs

Example: Kubernetes HPA (custom metrics)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  behavior:
    scaleDown: { stabilizationWindowSeconds: 300 }
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_p95_latency_ms
      target:
        type: AverageValue
        averageValue: "250"

Example: Queue scaling with KEDA (Pub/Sub/Rabbit/Kafka)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-queue
spec:
  scaleTargetRef:
    kind: Deployment
    name: worker
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: queue_lag
      threshold: "1000"

What to measure

  • p95 during burst and steady windows
  • Queue lag vs. consumer throughput
  • Spend / 1000 requests (cost-to-serve)

Result: Typically 20-30% cost reduction with fewer incidents-no code rewrite required.


Real-World Example: Before and After

Before (CPU-based scaling):

  • Average CPU: 45%
  • p95 latency: 450ms (SLO: 250ms)
  • Instances: 10-25 (thrashing)
  • Monthly cost: €2,400
  • Incidents: 3 per month (latency spikes)

After (p95-based scaling):

  • Average CPU: 60% (higher utilization)
  • p95 latency: 180ms (within SLO)
  • Instances: 5-12 (stable)
  • Monthly cost: €1,680 (30% reduction)
  • Incidents: 0 per month

Key insight: CPU doesn’t reflect user experience. p95 latency does.


Implementation Guide

Step 1: Instrument p95 Latency

For Kubernetes (Prometheus):

# Add to your deployment
apiVersion: v1
kind: Service
metadata:
  name: api
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"
spec:
  ports:
  - port: 80
    targetPort: 8080
---
# In your application code (Go example)
import (
  "github.com/prometheus/client_golang/prometheus"
  "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
  httpRequestDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
      Name: "http_request_duration_seconds",
      Help: "HTTP request duration in seconds",
      Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0},
    },
    []string{"method", "endpoint", "status"},
  )
)

func handler(w http.ResponseWriter, r *http.Request) {
  start := time.Now()
  // ... your handler logic ...
  duration := time.Since(start).Seconds()
  httpRequestDuration.WithLabelValues(
    r.Method,
    r.URL.Path,
    strconv.Itoa(statusCode),
  ).Observe(duration)
}

For Cloud Run / Cloud Functions:

  • Use OpenTelemetry or Cloud Monitoring client libraries
  • Export custom metrics to Cloud Monitoring
  • Create custom metrics from request logs

Step 2: Create Custom Metrics

Prometheus recording rule:

groups:
- name: slo
  interval: 30s
  rules:
  - record: http_p95_latency_ms
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
      ) * 1000

Cloud Monitoring custom metric:

# Python example for Cloud Run
from google.cloud import monitoring_v3
import time

client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"

# Create custom metric descriptor
descriptor = monitoring_v3.MetricDescriptor()
descriptor.type = "custom.googleapis.com/http_p95_latency_ms"
descriptor.metric_kind = monitoring_v3.MetricDescriptor.MetricKind.GAUGE
descriptor.value_type = monitoring_v3.MetricDescriptor.ValueType.DOUBLE
descriptor.description = "p95 latency in milliseconds"

client.create_metric_descriptor(
    name=project_name, metric_descriptor=descriptor
)

# Write metric
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/http_p95_latency_ms"
series.resource.type = "cloud_run_revision"
series.resource.labels["service_name"] = "api"
series.resource.labels["revision_name"] = "api-001"

point = monitoring_v3.Point()
point.value.double_value = p95_latency_ms
point.interval.end_time.seconds = int(time.time())
series.points = [point]

client.create_time_series(name=project_name, time_series=[series])

Step 3: Configure HPA with Custom Metrics

Kubernetes HPA with Prometheus adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min cooldown
      policies:
      - type: Percent
        value: 50  # Scale down max 50% at a time
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Can double replicas
        periodSeconds: 30
      - type: Pods
        value: 4  # Or add 4 pods
        periodSeconds: 30
      selectPolicy: Max  # Use the more aggressive policy
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_p95_latency_ms
      target:
        type: AverageValue
        averageValue: "250"  # Target: 250ms p95
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80  # Fallback to CPU if latency metric unavailable

Troubleshooting HPA:

# Check HPA status
kubectl get hpa api-hpa

# Describe HPA to see scaling decisions
kubectl describe hpa api-hpa

# Check if metrics are available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_p95_latency_ms"

Step 4: Add Queue-Based Scaling (Optional)

For services with queues (Pub/Sub, RabbitMQ, Kafka), add queue lag metrics:

KEDA ScaledObject for Pub/Sub:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-queue
spec:
  scaleTargetRef:
    kind: Deployment
    name: worker
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: pubsub_subscription_num_undelivered_messages
      threshold: "1000"  # Scale when > 1000 messages
      query: |
        sum(pubsub_subscription_num_undelivered_messages{subscription="worker-queue"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: http_p95_latency_ms
      threshold: "250"
      query: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
        ) * 1000

Cloud Run / Cloud Functions Alternative

If you’re not on Kubernetes, use Cloud Run with custom metrics:

1. Export p95 latency as custom metric:

# In your Cloud Run service
from google.cloud import monitoring_v3
import time

def write_p95_metric(p95_latency_ms):
    client = monitoring_v3.MetricServiceClient()
    project_name = f"projects/{PROJECT_ID}"
    
    series = monitoring_v3.TimeSeries()
    series.metric.type = "custom.googleapis.com/http_p95_latency_ms"
    series.resource.type = "cloud_run_revision"
    series.resource.labels["service_name"] = "api"
    
    point = monitoring_v3.Point()
    point.value.double_value = p95_latency_ms
    point.interval.end_time.seconds = int(time.time())
    series.points = [point]
    
    client.create_time_series(name=project_name, time_series=[series])

2. Create alerting policy:

# Cloud Monitoring alerting policy
displayName: "High p95 Latency"
conditions:
- displayName: "p95 latency > 250ms"
  conditionThreshold:
    filter: |
      resource.type = "cloud_run_revision"
      metric.type = "custom.googleapis.com/http_p95_latency_ms"
    comparison: COMPARISON_GT
    thresholdValue: 250
    duration: 300s  # 5 minutes
notificationChannels:
- projects/PROJECT_ID/notificationChannels/CHANNEL_ID

3. Adjust Cloud Run settings:

# Increase min instances when latency is high
gcloud run services update api \
  --min-instances=3 \
  --concurrency=40 \
  --max-instances=50 \
  --cpu-boost

4. Use Cloud Scheduler to scale based on metrics:

# Create Cloud Function that scales based on custom metrics
# Triggered by Cloud Monitoring alerts

Monitoring and Alerting

Key Metrics Dashboard

Grafana dashboard queries:

# p95 latency over time
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) * 1000

# Instance count
count(kube_pod_info{pod=~"api-.*"})

# Cost per 1000 requests (if you have cost metrics)
sum(rate(cloud_cost_euros[1h])) / sum(rate(http_requests_total[1h])) * 1000

# Queue lag (if applicable)
sum(pubsub_subscription_num_undelivered_messages)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))

Alerting Rules

groups:
- name: autoscaling
  rules:
  - alert: HighP95Latency
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      ) * 1000 > 250
    for: 5m
    annotations:
      summary: "p95 latency above SLO threshold"
      description: "p95 latency is ms, target is 250ms"
  
  - alert: ScalingThrashing
    expr: |
      rate(kube_horizontalpodautoscaler_status_current_replicas[5m]) > 0.5
    for: 10m
    annotations:
      summary: "HPA is thrashing (scaling up/down rapidly)"
      description: "Replica count changing too frequently"
  
  - alert: QueueLagHigh
    expr: |
      sum(pubsub_subscription_num_undelivered_messages) > 10000
    for: 5m
    annotations:
      summary: "Queue lag is high"
      description: " messages undelivered"

Common Pitfalls and Solutions

Pitfall 1: Metric Delay

Problem: p95 latency metric has 30-60 second delay, causing late scaling.

Solution:

  • Use shorter aggregation windows (1-2 minutes)
  • Combine with queue lag metrics (real-time)
  • Add predictive scaling based on traffic patterns

Pitfall 2: Noisy Metrics

Problem: p95 spikes from occasional slow requests, causing unnecessary scaling.

Solution:

  • Use longer aggregation windows (5 minutes)
  • Add minimum duration before scaling (stabilization window)
  • Filter out known slow endpoints (health checks, admin APIs)

Pitfall 3: Cold Start Latency

Problem: New instances have high latency during startup, triggering more scaling.

Solution:

  • Set minimum instances to avoid cold starts
  • Use startup CPU boost (Cloud Run)
  • Exclude startup period from latency calculations

Pitfall 4: Downstream Dependency Issues

Problem: High latency from database/API, not your service.

Solution:

  • Monitor downstream dependencies separately
  • Scale based on queue depth, not just latency
  • Implement circuit breakers to fail fast

Cost Optimization Tips

  1. Right-size instances: Use smaller instances with higher concurrency
  2. Set appropriate min/max: Don’t over-provision, but avoid cold starts
  3. Use spot/preemptible instances: For non-critical workloads
  4. Implement cost guardrails: Budget alerts, quota limits
  5. Monitor cost-to-serve: Track €/1000 requests over time

Expected savings:

  • 20-30% cost reduction from better utilization
  • 50-70% fewer incidents from proactive scaling
  • Better user experience from consistent latency

Next Steps

  1. Week 1: Instrument p95 latency, create custom metrics
  2. Week 2: Configure HPA with p95-based scaling
  3. Week 3: Add queue-based scaling (if applicable)
  4. Week 4: Tune thresholds, add alerts, measure impact

Cloud Run/Gateway note: If you’re not on K8s, tune concurrency, min/max instances, and trigger scale via custom metrics + alerts.

Want help implementing this?


This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.

© Copyright 2017-2025