DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Problem: Either you pay too much (min too high) or you get tail latency (min too low). Balance it.

Cloud Run’s serverless model is powerful, but misconfigured concurrency and instance settings can lead to either excessive costs or poor user experience. This guide walks through practical tuning strategies based on real production workloads.


Understanding Cloud Run’s Scaling Model

Cloud Run scales instances based on request concurrency-the number of simultaneous requests each container instance handles. Unlike traditional autoscaling that uses CPU/memory metrics, Cloud Run scales when:

  1. All instances are at capacity (concurrency limit reached)
  2. New requests arrive (queue forms)
  3. Traffic increases (more instances needed)

The key insight: concurrency is your primary cost lever. Higher concurrency = fewer instances = lower cost, but only if your application can handle it without degrading latency.


Settings that Matter

1. Concurrency: The Primary Cost Control

Concurrency determines how many requests each instance handles simultaneously. This is your biggest cost optimization opportunity.

Starting Points:

  • I/O-bound services (APIs, web servers): Start with 40-80 concurrent requests
  • CPU-bound services (ML inference, data processing): Start with 4-8 concurrent requests
  • Mixed workloads: Start with 20-40 and tune based on metrics

Why it matters:

  • Concurrency of 1 = every request gets its own instance (expensive, fast)
  • Concurrency of 1000 = one instance handles everything (cheap, but may queue)

Real-world example:

# High-traffic API (I/O-bound, database queries)
gcloud run deploy api \
  --concurrency=80 \
  --cpu=2 \
  --memory=2Gi

# ML inference service (CPU-bound, model predictions)
gcloud run deploy ml-service \
  --concurrency=4 \
  --cpu=4 \
  --memory=8Gi

Tuning strategy:

  1. Start conservative (lower concurrency)
  2. Monitor p95/p99 latency during peak traffic
  3. Gradually increase concurrency until latency degrades
  4. Back off 20% from the breaking point

2. Min Instances: Eliminating Cold Starts

Min instances keep containers warm, eliminating cold start latency for critical paths.

When to use:

  • User-facing APIs with strict latency SLAs (< 200ms p95)
  • High-traffic endpoints that can’t tolerate cold starts
  • Business-critical services during peak hours

Cost trade-off:

  • Min instances = 0: Pay only for requests, but cold starts on first request
  • Min instances = 5: Always paying for 5 instances, but zero cold starts

Practical guidance:

  • Critical APIs: 1-3 min instances
  • Background jobs: 0 min instances (cold starts acceptable)
  • Peak hours only: Use Cloud Scheduler to adjust min instances by time

Example: Business hours only

# During business hours (9 AM - 6 PM)
gcloud run services update api \
  --min-instances=3 \
  --region=europe-west3

# After hours (via Cloud Scheduler)
gcloud run services update api \
  --min-instances=0 \
  --region=europe-west3

Monitoring cold starts:

  • Track run.googleapis.com/container/startup_latencies metric
  • Alert if cold start p95 > 1 second
  • Use Cloud Run’s “startup CPU boost” to reduce cold start time

3. Max Instances: Cost Protection

Max instances caps your maximum spend and protects downstream dependencies.

Why it matters:

  • Cost control: Prevents runaway scaling during traffic spikes
  • Dependency protection: Protects databases, APIs from overload
  • Budget guardrails: Hard limit on concurrent instances

Setting max instances:

  • Start with 2-3x your peak traffic (measured in requests/second)
  • Consider downstream limits: Database connection pools, API rate limits
  • Add 50% buffer for traffic spikes

Example calculation:

Peak traffic: 1000 req/s
Concurrency: 40 req/instance
Required instances: 1000 / 40 = 25 instances
Max instances: 25 * 1.5 = 38 (round to 40)
gcloud run deploy api \
  --max-instances=40 \
  --concurrency=40

What happens when max is reached:

  • New requests return 429 Too Many Requests
  • Consider implementing exponential backoff in clients
  • Monitor run.googleapis.com/request_count to detect saturation

4. Startup CPU Boost: Faster Cold Starts

Startup CPU boost allocates extra CPU during container startup, reducing cold start time.

When to enable:

  • Cold starts > 500ms
  • Applications with heavy initialization (imports, model loading)
  • Services with strict latency SLAs

Cost impact:

  • Only charged during startup (typically 1-5 seconds)
  • Minimal cost increase, significant latency improvement

Enable it:

gcloud run deploy api \
  --cpu-boost \
  --cpu=2 \
  --memory=2Gi

Expected improvement:

  • Without boost: 2-5 second cold starts
  • With boost: 0.5-2 second cold starts
  • Warm instances: < 100ms (no cold start)

Complete Deployment Example

Here’s a production-ready configuration for a typical API service:

gcloud run deploy api \
  --image=gcr.io/my-project/api:latest \
  --concurrency=40 \
  --min-instances=2 \
  --max-instances=50 \
  --cpu=2 \
  --memory=2Gi \
  --cpu-boost \
  --timeout=300 \
  --region=europe-west3 \
  --allow-unauthenticated

Configuration rationale:

  • Concurrency 40: I/O-bound API, handles database queries efficiently
  • Min instances 2: Eliminates cold starts for critical user-facing API
  • Max instances 50: Handles 2000 req/s peak (50 * 40 = 2000)
  • CPU boost: Reduces cold start from 3s to 1s
  • 2 CPU, 2Gi memory: Sufficient for 40 concurrent requests with headroom

Measuring and Iterating

Key Metrics to Track

  1. Request latency (p50, p95, p99)
    • Monitor: run.googleapis.com/request_latencies
    • Alert if p95 > 250ms (adjust based on your SLA)
  2. Concurrency utilization
    • Monitor: run.googleapis.com/request_count / run.googleapis.com/container_instance_count
    • Target: 70-80% of concurrency limit (leaves headroom for spikes)
  3. Cold start frequency
    • Monitor: run.googleapis.com/container/startup_latencies
    • Track: % of requests hitting cold starts
    • Goal: < 1% of requests (with min instances)
  4. Instance count
    • Monitor: run.googleapis.com/container_instance_count
    • Correlate with traffic patterns and cost
  5. CPU and memory utilization
    • Monitor: run.googleapis.com/container/cpu/utilizations
    • Monitor: run.googleapis.com/container/memory/utilizations
    • Target: < 70% average (allows burst capacity)

Iterative Tuning Process

Week 1: Baseline

# Start conservative
--concurrency=20 \
--min-instances=0 \
--max-instances=20

Week 2: Optimize concurrency

  • Increase concurrency by 10 each day
  • Monitor latency degradation
  • Stop when p95 latency increases > 20%

Week 3: Add min instances

  • Set min instances = 1-2 for critical paths
  • Measure cold start reduction
  • Validate cost increase is acceptable

Week 4: Fine-tune max instances

  • Analyze peak traffic patterns
  • Set max = 2x observed peak
  • Add alerts for 429 responses

Common Pitfalls and Solutions

Pitfall 1: Concurrency Too High

Symptoms:

  • High p95/p99 latency
  • Request timeouts
  • Memory pressure

Solution:

# Reduce concurrency
gcloud run services update api \
  --concurrency=20  # Down from 80

Pitfall 2: Min Instances Too High

Symptoms:

  • High idle costs
  • Paying for unused capacity

Solution:

  • Use Cloud Scheduler to adjust min instances by time
  • Set min = 0 for non-critical services
  • Monitor actual traffic patterns

Pitfall 3: Max Instances Too Low

Symptoms:

  • 429 errors during traffic spikes
  • Users experiencing failures

Solution:

# Increase max instances
gcloud run services update api \
  --max-instances=100  # Up from 50

Pitfall 4: Ignoring Cold Starts

Symptoms:

  • Sporadic high latency (first request after idle)
  • User complaints about slow responses

Solution:

  • Enable CPU boost
  • Set min instances = 1-2
  • Implement health checks to keep instances warm

Advanced: Time-Based Scaling

For services with predictable traffic patterns, adjust min instances by time of day:

Cloud Scheduler jobs:

# Morning ramp-up (8 AM)
gcloud scheduler jobs create http scale-up-morning \
  --schedule="0 8 * * *" \
  --uri="https://europe-west3-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/my-project/services/api" \
  --http-method=PATCH \
  --oauth-service-account-email=cloud-run-scheduler@my-project.iam.gserviceaccount.com \
  --message-body='{"spec":{"template":{"spec":{"containerConcurrency":40,"minScale":5}}}}'

# Evening scale-down (8 PM)
gcloud scheduler jobs create http scale-down-evening \
  --schedule="0 20 * * *" \
  --uri="https://europe-west3-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/my-project/services/api" \
  --http-method=PATCH \
  --oauth-service-account-email=cloud-run-scheduler@my-project.iam.gserviceaccount.com \
  --message-body='{"spec":{"template":{"spec":{"containerConcurrency":40,"minScale":1}}}}'

Cost Optimization Checklist

  • Concurrency tuned: 70-80% utilization during peak
  • Min instances optimized: Only where cold starts hurt
  • Max instances set: Based on actual peak traffic + 50% buffer
  • CPU boost enabled: For services with > 500ms cold starts
  • Time-based scaling: Reduce min instances during off-hours
  • Monitoring alerts: Set up for latency, saturation, cold starts
  • Cost tracking: Monitor spend per service, per endpoint

Expected Outcomes

With proper tuning, you should see:

  • 20-40% cost reduction (from optimized concurrency)
  • < 1% cold start rate (from min instances)
  • p95 latency < 250ms (from proper concurrency)
  • Zero 429 errors (from appropriate max instances)
  • Predictable costs (from max instance caps)

Outcome: Lower cost with fewer tail-latency surprises-no code rewrite required.


Next Steps

Want the dashboard/runbook templates?

© Copyright 2017-2025