DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
Most teams scale on CPU averages. That’s easy-and often wrong. Align autoscaling with your p95 latency SLO instead.
1) Define SLO (e.g., 95% of requests < 250 ms) 2) Add instrumentation for p95 latency (per service/endpoint) 3) Configure autoscaling on custom metrics (p95 and/or request queue depth) 4) Add hysteresis and cool-down to avoid flapping 5) Protect queues with lag thresholds and exponential backoff 6) Set cost guardrails (budgets, quotas, anomaly detection)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
behavior:
scaleDown: { stabilizationWindowSeconds: 300 }
metrics:
- type: Pods
pods:
metric:
name: http_p95_latency_ms
target:
type: AverageValue
averageValue: "250"
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-queue
spec:
scaleTargetRef:
kind: Deployment
name: worker
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: queue_lag
threshold: "1000"
Result: Typically 20-30% cost reduction with fewer incidents-no code rewrite required.
Before (CPU-based scaling):
After (p95-based scaling):
Key insight: CPU doesn’t reflect user experience. p95 latency does.
For Kubernetes (Prometheus):
# Add to your deployment
apiVersion: v1
kind: Service
metadata:
name: api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
ports:
- port: 80
targetPort: 8080
---
# In your application code (Go example)
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0},
},
[]string{"method", "endpoint", "status"},
)
)
func handler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// ... your handler logic ...
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(
r.Method,
r.URL.Path,
strconv.Itoa(statusCode),
).Observe(duration)
}
For Cloud Run / Cloud Functions:
Prometheus recording rule:
groups:
- name: slo
interval: 30s
rules:
- record: http_p95_latency_ms
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) * 1000
Cloud Monitoring custom metric:
# Python example for Cloud Run
from google.cloud import monitoring_v3
import time
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"
# Create custom metric descriptor
descriptor = monitoring_v3.MetricDescriptor()
descriptor.type = "custom.googleapis.com/http_p95_latency_ms"
descriptor.metric_kind = monitoring_v3.MetricDescriptor.MetricKind.GAUGE
descriptor.value_type = monitoring_v3.MetricDescriptor.ValueType.DOUBLE
descriptor.description = "p95 latency in milliseconds"
client.create_metric_descriptor(
name=project_name, metric_descriptor=descriptor
)
# Write metric
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/http_p95_latency_ms"
series.resource.type = "cloud_run_revision"
series.resource.labels["service_name"] = "api"
series.resource.labels["revision_name"] = "api-001"
point = monitoring_v3.Point()
point.value.double_value = p95_latency_ms
point.interval.end_time.seconds = int(time.time())
series.points = [point]
client.create_time_series(name=project_name, time_series=[series])
Kubernetes HPA with Prometheus adapter:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown
policies:
- type: Percent
value: 50 # Scale down max 50% at a time
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Can double replicas
periodSeconds: 30
- type: Pods
value: 4 # Or add 4 pods
periodSeconds: 30
selectPolicy: Max # Use the more aggressive policy
metrics:
- type: Pods
pods:
metric:
name: http_p95_latency_ms
target:
type: AverageValue
averageValue: "250" # Target: 250ms p95
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80 # Fallback to CPU if latency metric unavailable
Troubleshooting HPA:
# Check HPA status
kubectl get hpa api-hpa
# Describe HPA to see scaling decisions
kubectl describe hpa api-hpa
# Check if metrics are available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_p95_latency_ms"
For services with queues (Pub/Sub, RabbitMQ, Kafka), add queue lag metrics:
KEDA ScaledObject for Pub/Sub:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-queue
spec:
scaleTargetRef:
kind: Deployment
name: worker
minReplicaCount: 1
maxReplicaCount: 50
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: pubsub_subscription_num_undelivered_messages
threshold: "1000" # Scale when > 1000 messages
query: |
sum(pubsub_subscription_num_undelivered_messages{subscription="worker-queue"})
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_p95_latency_ms
threshold: "250"
query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) * 1000
If you’re not on Kubernetes, use Cloud Run with custom metrics:
1. Export p95 latency as custom metric:
# In your Cloud Run service
from google.cloud import monitoring_v3
import time
def write_p95_metric(p95_latency_ms):
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/http_p95_latency_ms"
series.resource.type = "cloud_run_revision"
series.resource.labels["service_name"] = "api"
point = monitoring_v3.Point()
point.value.double_value = p95_latency_ms
point.interval.end_time.seconds = int(time.time())
series.points = [point]
client.create_time_series(name=project_name, time_series=[series])
2. Create alerting policy:
# Cloud Monitoring alerting policy
displayName: "High p95 Latency"
conditions:
- displayName: "p95 latency > 250ms"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
metric.type = "custom.googleapis.com/http_p95_latency_ms"
comparison: COMPARISON_GT
thresholdValue: 250
duration: 300s # 5 minutes
notificationChannels:
- projects/PROJECT_ID/notificationChannels/CHANNEL_ID
3. Adjust Cloud Run settings:
# Increase min instances when latency is high
gcloud run services update api \
--min-instances=3 \
--concurrency=40 \
--max-instances=50 \
--cpu-boost
4. Use Cloud Scheduler to scale based on metrics:
# Create Cloud Function that scales based on custom metrics
# Triggered by Cloud Monitoring alerts
Grafana dashboard queries:
# p95 latency over time
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) * 1000
# Instance count
count(kube_pod_info{pod=~"api-.*"})
# Cost per 1000 requests (if you have cost metrics)
sum(rate(cloud_cost_euros[1h])) / sum(rate(http_requests_total[1h])) * 1000
# Queue lag (if applicable)
sum(pubsub_subscription_num_undelivered_messages)
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
groups:
- name: autoscaling
rules:
- alert: HighP95Latency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) * 1000 > 250
for: 5m
annotations:
summary: "p95 latency above SLO threshold"
description: "p95 latency is ms, target is 250ms"
- alert: ScalingThrashing
expr: |
rate(kube_horizontalpodautoscaler_status_current_replicas[5m]) > 0.5
for: 10m
annotations:
summary: "HPA is thrashing (scaling up/down rapidly)"
description: "Replica count changing too frequently"
- alert: QueueLagHigh
expr: |
sum(pubsub_subscription_num_undelivered_messages) > 10000
for: 5m
annotations:
summary: "Queue lag is high"
description: " messages undelivered"
Problem: p95 latency metric has 30-60 second delay, causing late scaling.
Solution:
Problem: p95 spikes from occasional slow requests, causing unnecessary scaling.
Solution:
Problem: New instances have high latency during startup, triggering more scaling.
Solution:
Problem: High latency from database/API, not your service.
Solution:
Expected savings:
Cloud Run/Gateway note: If you’re not on K8s, tune concurrency, min/max instances, and trigger scale via custom metrics + alerts.
Want help implementing this?
This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.