DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
Problem: Either you pay too much (min too high) or you get tail latency (min too low). Balance it.
Cloud Run’s serverless model is powerful, but misconfigured concurrency and instance settings can lead to either excessive costs or poor user experience. This guide walks through practical tuning strategies based on real production workloads.
Cloud Run scales instances based on request concurrency-the number of simultaneous requests each container instance handles. Unlike traditional autoscaling that uses CPU/memory metrics, Cloud Run scales when:
The key insight: concurrency is your primary cost lever. Higher concurrency = fewer instances = lower cost, but only if your application can handle it without degrading latency.
Concurrency determines how many requests each instance handles simultaneously. This is your biggest cost optimization opportunity.
Starting Points:
Why it matters:
Real-world example:
# High-traffic API (I/O-bound, database queries)
gcloud run deploy api \
--concurrency=80 \
--cpu=2 \
--memory=2Gi
# ML inference service (CPU-bound, model predictions)
gcloud run deploy ml-service \
--concurrency=4 \
--cpu=4 \
--memory=8Gi
Tuning strategy:
Min instances keep containers warm, eliminating cold start latency for critical paths.
When to use:
Cost trade-off:
Practical guidance:
Example: Business hours only
# During business hours (9 AM - 6 PM)
gcloud run services update api \
--min-instances=3 \
--region=europe-west3
# After hours (via Cloud Scheduler)
gcloud run services update api \
--min-instances=0 \
--region=europe-west3
Monitoring cold starts:
run.googleapis.com/container/startup_latencies metricMax instances caps your maximum spend and protects downstream dependencies.
Why it matters:
Setting max instances:
Example calculation:
Peak traffic: 1000 req/s
Concurrency: 40 req/instance
Required instances: 1000 / 40 = 25 instances
Max instances: 25 * 1.5 = 38 (round to 40)
gcloud run deploy api \
--max-instances=40 \
--concurrency=40
What happens when max is reached:
run.googleapis.com/request_count to detect saturationStartup CPU boost allocates extra CPU during container startup, reducing cold start time.
When to enable:
Cost impact:
Enable it:
gcloud run deploy api \
--cpu-boost \
--cpu=2 \
--memory=2Gi
Expected improvement:
Here’s a production-ready configuration for a typical API service:
gcloud run deploy api \
--image=gcr.io/my-project/api:latest \
--concurrency=40 \
--min-instances=2 \
--max-instances=50 \
--cpu=2 \
--memory=2Gi \
--cpu-boost \
--timeout=300 \
--region=europe-west3 \
--allow-unauthenticated
Configuration rationale:
run.googleapis.com/request_latenciesrun.googleapis.com/request_count / run.googleapis.com/container_instance_countrun.googleapis.com/container/startup_latenciesrun.googleapis.com/container_instance_countrun.googleapis.com/container/cpu/utilizationsrun.googleapis.com/container/memory/utilizationsWeek 1: Baseline
# Start conservative
--concurrency=20 \
--min-instances=0 \
--max-instances=20
Week 2: Optimize concurrency
Week 3: Add min instances
Week 4: Fine-tune max instances
Symptoms:
Solution:
# Reduce concurrency
gcloud run services update api \
--concurrency=20 # Down from 80
Symptoms:
Solution:
Symptoms:
Solution:
# Increase max instances
gcloud run services update api \
--max-instances=100 # Up from 50
Symptoms:
Solution:
For services with predictable traffic patterns, adjust min instances by time of day:
Cloud Scheduler jobs:
# Morning ramp-up (8 AM)
gcloud scheduler jobs create http scale-up-morning \
--schedule="0 8 * * *" \
--uri="https://europe-west3-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/my-project/services/api" \
--http-method=PATCH \
--oauth-service-account-email=cloud-run-scheduler@my-project.iam.gserviceaccount.com \
--message-body='{"spec":{"template":{"spec":{"containerConcurrency":40,"minScale":5}}}}'
# Evening scale-down (8 PM)
gcloud scheduler jobs create http scale-down-evening \
--schedule="0 20 * * *" \
--uri="https://europe-west3-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/my-project/services/api" \
--http-method=PATCH \
--oauth-service-account-email=cloud-run-scheduler@my-project.iam.gserviceaccount.com \
--message-body='{"spec":{"template":{"spec":{"containerConcurrency":40,"minScale":1}}}}'
With proper tuning, you should see:
Outcome: Lower cost with fewer tail-latency surprises-no code rewrite required.
Want the dashboard/runbook templates?