DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
Most teams scale on CPU averages. That’s easy-and often wrong. Align autoscaling with your p95 latency SLO instead.
1) Define SLO (e.g., 95% of requests < 250 ms) 2) Add instrumentation for p95 latency (per service/endpoint) 3) Configure autoscaling on custom metrics (p95 and/or request queue depth) 4) Add hysteresis and cool-down to avoid flapping 5) Protect queues with lag thresholds and exponential backoff 6) Set cost guardrails (budgets, quotas, anomaly detection)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
behavior:
scaleDown: { stabilizationWindowSeconds: 300 }
metrics:
- type: Pods
pods:
metric:
name: http_p95_latency_ms
target:
type: AverageValue
averageValue: "250"
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-queue
spec:
scaleTargetRef:
kind: Deployment
name: worker
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: queue_lag
threshold: "1000"
Result: Typically 20-30% cost reduction with fewer incidents-no code rewrite required.
Before (CPU-based scaling):
After (p95-based scaling):
Key insight: CPU doesn’t reflect user experience. p95 latency does.
For Kubernetes (Prometheus):
# Add to your deployment
apiVersion: v1
kind: Service
metadata:
name: api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
ports:
- port: 80
targetPort: 8080
---
# In your application code (Go example)
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0},
},
[]string{"method", "endpoint", "status"},
)
)
func handler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// ... your handler logic ...
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(
r.Method,
r.URL.Path,
strconv.Itoa(statusCode),
).Observe(duration)
}
For Cloud Run / Cloud Functions:
Prometheus recording rule:
groups:
- name: slo
interval: 30s
rules:
- record: http_p95_latency_ms
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) * 1000
Cloud Monitoring custom metric:
# Python example for Cloud Run
from google.cloud import monitoring_v3
import time
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"
# Create custom metric descriptor
descriptor = monitoring_v3.MetricDescriptor()
descriptor.type = "custom.googleapis.com/http_p95_latency_ms"
descriptor.metric_kind = monitoring_v3.MetricDescriptor.MetricKind.GAUGE
descriptor.value_type = monitoring_v3.MetricDescriptor.ValueType.DOUBLE
descriptor.description = "p95 latency in milliseconds"
client.create_metric_descriptor(
name=project_name, metric_descriptor=descriptor
)
# Write metric
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/http_p95_latency_ms"
series.resource.type = "cloud_run_revision"
series.resource.labels["service_name"] = "api"
series.resource.labels["revision_name"] = "api-001"
point = monitoring_v3.Point()
point.value.double_value = p95_latency_ms
point.interval.end_time.seconds = int(time.time())
series.points = [point]
client.create_time_series(name=project_name, time_series=[series])
Kubernetes HPA with Prometheus adapter:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown
policies:
- type: Percent
value: 50 # Scale down max 50% at a time
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Can double replicas
periodSeconds: 30
- type: Pods
value: 4 # Or add 4 pods
periodSeconds: 30
selectPolicy: Max # Use the more aggressive policy
metrics:
- type: Pods
pods:
metric:
name: http_p95_latency_ms
target:
type: AverageValue
averageValue: "250" # Target: 250ms p95
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80 # Fallback to CPU if latency metric unavailable
Troubleshooting HPA:
# Check HPA status
kubectl get hpa api-hpa
# Describe HPA to see scaling decisions
kubectl describe hpa api-hpa
# Check if metrics are available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_p95_latency_ms"
For services with queues (Pub/Sub, RabbitMQ, Kafka), add queue lag metrics:
KEDA ScaledObject for Pub/Sub:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-queue
spec:
scaleTargetRef:
kind: Deployment
name: worker
minReplicaCount: 1
maxReplicaCount: 50
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: pubsub_subscription_num_undelivered_messages
threshold: "1000" # Scale when > 1000 messages
query: |
sum(pubsub_subscription_num_undelivered_messages{subscription="worker-queue"})
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_p95_latency_ms
threshold: "250"
query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) * 1000
If you’re not on Kubernetes, use Cloud Run with custom metrics:
1. Export p95 latency as custom metric:
# In your Cloud Run service
from google.cloud import monitoring_v3
import time
def write_p95_metric(p95_latency_ms):
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/http_p95_latency_ms"
series.resource.type = "cloud_run_revision"
series.resource.labels["service_name"] = "api"
point = monitoring_v3.Point()
point.value.double_value = p95_latency_ms
point.interval.end_time.seconds = int(time.time())
series.points = [point]
client.create_time_series(name=project_name, time_series=[series])
2. Create alerting policy:
# Cloud Monitoring alerting policy
displayName: "High p95 Latency"
conditions:
- displayName: "p95 latency > 250ms"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
metric.type = "custom.googleapis.com/http_p95_latency_ms"
comparison: COMPARISON_GT
thresholdValue: 250
duration: 300s # 5 minutes
notificationChannels:
- projects/PROJECT_ID/notificationChannels/CHANNEL_ID
3. Adjust Cloud Run settings:
# Increase min instances when latency is high
gcloud run services update api \
--min-instances=3 \
--concurrency=40 \
--max-instances=50 \
--cpu-boost
4. Use Cloud Scheduler to scale based on metrics:
# Create Cloud Function that scales based on custom metrics
# Triggered by Cloud Monitoring alerts
Grafana dashboard queries:
# p95 latency over time
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) * 1000
# Instance count
count(kube_pod_info{pod=~"api-.*"})
# Cost per 1000 requests (if you have cost metrics)
sum(rate(cloud_cost_euros[1h])) / sum(rate(http_requests_total[1h])) * 1000
# Queue lag (if applicable)
sum(pubsub_subscription_num_undelivered_messages)
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
groups:
- name: autoscaling
rules:
- alert: HighP95Latency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) * 1000 > 250
for: 5m
annotations:
summary: "p95 latency above SLO threshold"
description: "p95 latency is ms, target is 250ms"
- alert: ScalingThrashing
expr: |
rate(kube_horizontalpodautoscaler_status_current_replicas[5m]) > 0.5
for: 10m
annotations:
summary: "HPA is thrashing (scaling up/down rapidly)"
description: "Replica count changing too frequently"
- alert: QueueLagHigh
expr: |
sum(pubsub_subscription_num_undelivered_messages) > 10000
for: 5m
annotations:
summary: "Queue lag is high"
description: " messages undelivered"
Problem: p95 latency metric has 30-60 second delay, causing late scaling.
Solution:
Problem: p95 spikes from occasional slow requests, causing unnecessary scaling.
Solution:
Problem: New instances have high latency during startup, triggering more scaling.
Solution:
Problem: High latency from database/API, not your service.
Solution:
Expected savings:
Cloud Run/Gateway note: If you’re not on K8s, tune concurrency, min/max instances, and trigger scale via custom metrics + alerts.
Want help implementing this?