Cloud Run: Concurrency, Min/Max Instances, and Cold Start Tuning

Cloud Run’s serverless model is powerful, but misconfigured concurrency and instance settings can lead to either excessive costs or poor user experience. This guide walks through practical tuning strategies based on real production workloads.

Understanding Cloud Run’s Scaling Model

Cloud Run scales instances based on request concurrency-the number of simultaneous requests each container instance handles. Unlike traditional autoscaling that uses CPU/memory metrics, Cloud Run scales when:

All instances are at capacity (concurrency limit reached)
New requests arrive (queue forms)
Traffic increases (more instances needed)

The key insight: concurrency is your primary cost lever. Higher concurrency = fewer instances = lower cost, but only if your application can handle it without degrading latency.

Settings that Matter

1. Concurrency: The Primary Cost Control

Concurrency determines how many requests each instance handles simultaneously. This is your biggest cost optimization opportunity.

Starting Points:

I/O-bound services (APIs, web servers): Start with 40-80 concurrent requests
CPU-bound services (ML inference, data processing): Start with 4-8 concurrent requests
Mixed workloads: Start with 20-40 and tune based on metrics

Why it matters:

Concurrency of 1 = every request gets its own instance (expensive, fast)
Concurrency of 1000 = one instance handles everything (cheap, but may queue)

Real-world example:

# High-traffic API (I/O-bound, database queries)
gcloud run deploy api \
  --concurrency=80 \
  --cpu=2 \
  --memory=2Gi

# ML inference service (CPU-bound, model predictions)
gcloud run deploy ml-service \
  --concurrency=4 \
  --cpu=4 \
  --memory=8Gi

Tuning strategy:

Start conservative (lower concurrency)
Monitor p95/p99 latency during peak traffic
Gradually increase concurrency until latency degrades
Back off 20% from the breaking point

2. Min Instances: Eliminating Cold Starts

Min instances keep containers warm, eliminating cold start latency for critical paths.

When to use:

User-facing APIs with strict latency SLAs (< 200ms p95)
High-traffic endpoints that can’t tolerate cold starts
Business-critical services during peak hours

Cost trade-off:

Min instances = 0: Pay only for requests, but cold starts on first request
Min instances = 5: Always paying for 5 instances, but zero cold starts

Practical guidance:

Critical APIs: 1-3 min instances
Background jobs: 0 min instances (cold starts acceptable)
Peak hours only: Use Cloud Scheduler to adjust min instances by time

Example: Business hours only

# During business hours (9 AM - 6 PM)
gcloud run services update api \
  --min-instances=3 \
  --region=europe-west3

# After hours (via Cloud Scheduler)
gcloud run services update api \
  --min-instances=0 \
  --region=europe-west3

Monitoring cold starts:

Track run.googleapis.com/container/startup_latencies metric
Alert if cold start p95 > 1 second
Use Cloud Run’s “startup CPU boost” to reduce cold start time

3. Max Instances: Cost Protection

Max instances caps your maximum spend and protects downstream dependencies.

Why it matters:

Cost control: Prevents runaway scaling during traffic spikes
Dependency protection: Protects databases, APIs from overload
Budget guardrails: Hard limit on concurrent instances

Setting max instances:

Start with 2-3x your peak traffic (measured in requests/second)
Consider downstream limits: Database connection pools, API rate limits
Add 50% buffer for traffic spikes

Example calculation:

Peak traffic: 1000 req/s
Concurrency: 40 req/instance
Required instances: 1000 / 40 = 25 instances
Max instances: 25 * 1.5 = 38 (round to 40)

gcloud run deploy api \
  --max-instances=40 \
  --concurrency=40

What happens when max is reached:

New requests return 429 Too Many Requests
Consider implementing exponential backoff in clients
Monitor run.googleapis.com/request_count to detect saturation

4. Startup CPU Boost: Faster Cold Starts

Startup CPU boost allocates extra CPU during container startup, reducing cold start time.

When to enable:

Cold starts > 500ms
Applications with heavy initialization (imports, model loading)
Services with strict latency SLAs

Cost impact:

Only charged during startup (typically 1-5 seconds)
Minimal cost increase, significant latency improvement

Enable it:

gcloud run deploy api \
  --cpu-boost \
  --cpu=2 \
  --memory=2Gi

Expected improvement:

Without boost: 2-5 second cold starts
With boost: 0.5-2 second cold starts
Warm instances: < 100ms (no cold start)

Complete Deployment Example

Here’s a production-ready configuration for a typical API service:

gcloud run deploy api \
  --image=gcr.io/my-project/api:latest \
  --concurrency=40 \
  --min-instances=2 \
  --max-instances=50 \
  --cpu=2 \
  --memory=2Gi \
  --cpu-boost \
  --timeout=300 \
  --region=europe-west3 \
  --allow-unauthenticated

Configuration rationale:

Concurrency 40: I/O-bound API, handles database queries efficiently
Min instances 2: Eliminates cold starts for critical user-facing API
Max instances 50: Handles 2000 req/s peak (50 * 40 = 2000)
CPU boost: Reduces cold start from 3s to 1s
2 CPU, 2Gi memory: Sufficient for 40 concurrent requests with headroom

Measuring and Iterating

Key Metrics to Track

Request latency (p50, p95, p99)
- Monitor: run.googleapis.com/request_latencies
- Alert if p95 > 250ms (adjust based on your SLA)
Concurrency utilization
- Monitor: run.googleapis.com/request_count / run.googleapis.com/container_instance_count
- Target: 70-80% of concurrency limit (leaves headroom for spikes)
Cold start frequency
- Monitor: run.googleapis.com/container/startup_latencies
- Track: % of requests hitting cold starts
- Goal: < 1% of requests (with min instances)
Instance count
- Monitor: run.googleapis.com/container_instance_count
- Correlate with traffic patterns and cost
CPU and memory utilization
- Monitor: run.googleapis.com/container/cpu/utilizations
- Monitor: run.googleapis.com/container/memory/utilizations
- Target: < 70% average (allows burst capacity)

Iterative Tuning Process

Week 1: Baseline

# Start conservative
--concurrency=20 \
--min-instances=0 \
--max-instances=20

Week 2: Optimize concurrency

Increase concurrency by 10 each day
Monitor latency degradation
Stop when p95 latency increases > 20%

Week 3: Add min instances

Set min instances = 1-2 for critical paths
Measure cold start reduction
Validate cost increase is acceptable

Week 4: Fine-tune max instances

Analyze peak traffic patterns
Set max = 2x observed peak
Add alerts for 429 responses

Common Pitfalls and Solutions

Pitfall 1: Concurrency Too High

Symptoms:

High p95/p99 latency
Request timeouts
Memory pressure

Solution:

# Reduce concurrency
gcloud run services update api \
  --concurrency=20  # Down from 80

Pitfall 2: Min Instances Too High

Symptoms:

High idle costs
Paying for unused capacity

Solution:

Use Cloud Scheduler to adjust min instances by time
Set min = 0 for non-critical services
Monitor actual traffic patterns

Pitfall 3: Max Instances Too Low

Symptoms:

429 errors during traffic spikes
Users experiencing failures

Solution:

# Increase max instances
gcloud run services update api \
  --max-instances=100  # Up from 50

Pitfall 4: Ignoring Cold Starts

Symptoms:

Sporadic high latency (first request after idle)
User complaints about slow responses

Solution:

Enable CPU boost
Set min instances = 1-2
Implement health checks to keep instances warm

Advanced: Time-Based Scaling

For services with predictable traffic patterns, adjust min instances by time of day:

Cloud Scheduler jobs:

# Morning ramp-up (8 AM)
gcloud scheduler jobs create http scale-up-morning \
  --schedule="0 8 * * *" \
  --uri="https://europe-west3-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/my-project/services/api" \
  --http-method=PATCH \
  --oauth-service-account-email=cloud-run-scheduler@my-project.iam.gserviceaccount.com \
  --message-body='{"spec":{"template":{"spec":{"containerConcurrency":40,"minScale":5}}}}'

# Evening scale-down (8 PM)
gcloud scheduler jobs create http scale-down-evening \
  --schedule="0 20 * * *" \
  --uri="https://europe-west3-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/my-project/services/api" \
  --http-method=PATCH \
  --oauth-service-account-email=cloud-run-scheduler@my-project.iam.gserviceaccount.com \
  --message-body='{"spec":{"template":{"spec":{"containerConcurrency":40,"minScale":1}}}}'

Cost Optimization Checklist

Concurrency tuned: 70-80% utilization during peak
Min instances optimized: Only where cold starts hurt
Max instances set: Based on actual peak traffic + 50% buffer
CPU boost enabled: For services with > 500ms cold starts
Time-based scaling: Reduce min instances during off-hours
Monitoring alerts: Set up for latency, saturation, cold starts
Cost tracking: Monitor spend per service, per endpoint

Expected Outcomes

With proper tuning, you should see:

20-40% cost reduction (from optimized concurrency)
< 1% cold start rate (from min instances)
p95 latency < 250ms (from proper concurrency)
Zero 429 errors (from appropriate max instances)
Predictable costs (from max instance caps)

Outcome: Lower cost with fewer tail-latency surprises-no code rewrite required.

Next Steps

Want the dashboard/runbook templates?