Terraform Guardrails that Save Real Money (and Incidents)

terraform {
  required_version = ">= 1.6.0"
  backend "gcs" {
    bucket = "tf-state-prod"
    prefix = "platform"
  }
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.30" }
  }
}

Naming & labels

locals {
  name   = "api"
  env    = "prod"
  labels = { owner = "platform", env = local.env, costcenter = "fintech" }
}

1) Default labels & cost centers (propagate everywhere)

variable "labels" {
  type    = map(string)
  default = { owner = "platform", env = "prod", costcenter = "fintech" }
}

resource "google_compute_instance" "api" {
  name   = "api-prod-1"
  # ...
  labels = var.labels
}

Make labels a module input and require it for all resources. It unlocks per-team cost allocation and better budgets.

2) Budgets & alerts (GCP) - tie to Pub/Sub / email / Slack

resource "google_billing_budget" "prod" {
  amount { specified_amount { units = "20000" } } # €20k/month
  threshold_rules { threshold_percent = 0.5 }
  threshold_rules { threshold_percent = 0.9 }
  all_updates_rule {
    pubsub_topic           = google_pubsub_topic.budget.id
    schema_version         = "1.0"
    monitoring_notification_channels = [google_monitoring_notification_channel.email.id]
  }
}

Route Pub/Sub to Cloud Functions/Run -> Slack. Add per-label budgets for big spenders (datasets, projects).

3) Storage lifecycle (BigQuery/GCS) - TTLs and partition discipline

BigQuery dataset defaults

resource "google_bigquery_dataset" "logs" {
  dataset_id                  = "logs"
  default_table_expiration_ms = 30 * 24 * 60 * 60 * 1000  # 30d
  default_partition_expiration_ms = 30 * 24 * 60 * 60 * 1000
  labels = var.labels
  # Important: force partition filters for cost control
  default_encryption_configuration { kms_key_name = google_kms_crypto_key.bq.id }
}

Require partition filter at table level

resource "google_bigquery_table" "logs" {
  dataset_id = google_bigquery_dataset.logs.dataset_id
  table_id   = "http_access"
  time_partitioning { type = "DAY" }
  require_partition_filter = true
  labels = var.labels
}

GCS lifecycle

resource "google_storage_bucket" "logs" {
  name     = "logs-${local.env}"
  location = "EU"
  versioning { enabled = true }
  lifecycle_rule {
    condition { age = 30 }
    action    { type = "Delete" }
  }
  labels = var.labels
}

4) Org policies & IAM boundaries - stop expensive mistakes up front

resource "google_org_policy_policy" "disable_serial_ports" {
  name   = "organizations/${var.org_id}/policies/compute.disableSerialPortAccess"
  parent = "organizations/${var.org_id}"
  spec {
    rules { enforce = true }
  }
}

resource "google_org_policy_policy" "vm_external_ip_denied" {
  name   = "organizations/${var.org_id}/policies/compute.vmExternalIpAccess"
  parent = "organizations/${var.org_id}"
  spec { rules { deny_all = true } } # enforce private-by-default; use exceptions per project
}

Add policies for uniform bucket-level access, CMEK required, restrict egress locations, and allowed machine types to prevent oversized SKUs.

5) CMEK (customer-managed keys) - encryption defaults

resource "google_kms_key_ring" "platform" {
  name     = "platform"
  location = "europe-west3"
}
resource "google_kms_crypto_key" "bq" {
  name            = "bq-default"
  key_ring        = google_kms_key_ring.platform.id
  rotation_period = "7776000s" # 90 days
}
# Use in BigQuery/GCS/Disks as defaults (see dataset example above)

6) p95-based autoscaling (K8s) - scale on user pain, not CPU

Export http_p95_latency_ms (OTel/Datadog -> Prom -> custom metric). Drive HPA off latency/queue depth, not CPU.

HPA sketch

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 2
  maxReplicas: 20
  behavior: { scaleDown: { stabilizationWindowSeconds: 300 } }
  metrics:
  - type: Pods
    pods:
      metric: { name: http_p95_latency_ms }
      target: { type: AverageValue, averageValue: "250" } # ms SLO

KEDA for queue lag

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker-queue }
spec:
  scaleTargetRef: { kind: Deployment, name: worker }
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: queue_lag
      threshold: "1000"

You can manage HPA/KEDA via Terraform’s kubernetes_manifest provider if you prefer IaC ownership.

7) Kill-switches for retry storms (operational safety)

Config flag to disable retries or lower concurrency.
Wire as a ConfigMap/Secret that Terraform can set for incident mode.

apiVersion: v1
kind: ConfigMap
metadata: { name: ops-flags }
data:
  DISABLE_RETRY: "false"
  MAX_CONCURRENCY: "8"

Your worker reads these at runtime; toggling reduces cascading failures and cost runaways during incidents.

8) Scheduled savings (non-prod off-hours)

Turn down non-prod nights/weekends.

resource "google_cloud_scheduler_job" "stop_nonprod" {
  name        = "stop-nonprod-evening"
  schedule    = "0 20 * * 1-5" # 20:00 Mon-Fri
  http_target {
    uri         = google_cloud_run_service.ops_hook.status[0].url
    http_method = "POST"
    oidc_token { service_account_email = google_service_account.ops.email }
    body        = base64encode("{\"action\":\"scale_down\"}")
  }
}

9) Monitoring policies (SLO & cost)

Spend anomaly alert (Monitoring)

resource "google_monitoring_alert_policy" "cost_anomaly" {
  display_name = "Cost Anomaly Alert"
  conditions {
    display_name = "Daily cost > baseline * 1.5"
    condition_monitoring_query_language {
      query = "fetch billing | metric 'billing.googleapis.com/daily_cost' | condition ratio > 1.5"
      duration = "1800s"
      trigger { count = 1 }
    }
  }
  notification_channels = [google_monitoring_notification_channel.email.id]
}

Pair this with SLO burn-rate alerts (availability & latency) so you’re paging on user pain and watching spend.

10) Policy-as-code in CI (prevent bad merges)

terraform fmt -check / validate / plan
Infracost to show € delta in PRs
OPA/Conftest/Sentinel to block policies (no external IPs, TTL required, labels required)

GitHub Actions sketch

- name: Terraform Validate
  run: terraform validate

- name: Infracost
  uses: infracost/actions/setup@v2
# ... compute & post diff as PR comment

- name: Conftest (OPA)
  run: conftest test policy/ --input terraform.plan.json

11) Drift detection & `prevent_destroy`

Detect drift regularly; and protect critical resources.

resource "google_bigquery_dataset" "core" {
  dataset_id = "core"
  # ...
  lifecycle {
    prevent_destroy = true
    ignore_changes  = [labels] # if labels mutate out-of-band
  }
}

12) Environment separation (blast radius)

Separate projects per env (proj-nonprod, proj-prod).
Separate state per env/team.
Use service perimeters (VPC-SC) for sensitive data projects.

13) Optional: reservations & caps (BigQuery, GKE)

BigQuery Reservations for predictable cost; assign by label/project.
GKE Autopilot with max nodes per node pool; cap cluster scale to avoid budget blow-outs.

14) Documentation block (per module)

Each module README should state:

Labels required, CMEK used, Org policies assumed
SLO/Alert hooks (what metrics/alerts are provided)
Cost levers (TTL, partition filters, SKU size)

Recap outcome

20–30% cost reduction via TTLs, partition filters, right-sizing, off-hours.
Fewer incidents with p95-driven scaling, retry kill-switches, and org policies.
Audit ease thanks to labels, CMEK, budgets, and policy-as-code.

Minimal checklist (print & enforce)

Remote state + provider versions pinned
Labels enforced on all resources
Budgets + cost anomaly alerts wired
BigQuery require_partition_filter + TTLs
Org policies: external IPs off (default), CMEK required
p95/queue-based autoscaling; HPA/KEDA in IaC
Retry kill-switches; runbook link from alerts
Non-prod schedules; drift detection; prevent_destroy
CI: validate/plan + Infracost + OPA policy checks