DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Takeaway: a handful of Terraform patterns prevent surprise spend, reduce pager incidents, and make audits easy.


0) Foundations (state, versions, naming)

Remote state + locking

terraform {
  required_version = ">= 1.6.0"
  backend "gcs" {
    bucket = "tf-state-prod"
    prefix = "platform"
  }
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.30" }
  }
}

Naming & labels

locals {
  name   = "api"
  env    = "prod"
  labels = { owner = "platform", env = local.env, costcenter = "fintech" }
}

1) Default labels & cost centers (propagate everywhere)

variable "labels" {
  type    = map(string)
  default = { owner = "platform", env = "prod", costcenter = "fintech" }
}

resource "google_compute_instance" "api" {
  name   = "api-prod-1"
  # ...
  labels = var.labels
}

Make labels a module input and require it for all resources. It unlocks per-team cost allocation and better budgets.


2) Budgets & alerts (GCP) - tie to Pub/Sub / email / Slack

resource "google_billing_budget" "prod" {
  amount { specified_amount { units = "20000" } } # €20k/month
  threshold_rules { threshold_percent = 0.5 }
  threshold_rules { threshold_percent = 0.9 }
  all_updates_rule {
    pubsub_topic           = google_pubsub_topic.budget.id
    schema_version         = "1.0"
    monitoring_notification_channels = [google_monitoring_notification_channel.email.id]
  }
}

Route Pub/Sub to Cloud Functions/Run -> Slack. Add per-label budgets for big spenders (datasets, projects).


3) Storage lifecycle (BigQuery/GCS) - TTLs and partition discipline

BigQuery dataset defaults

resource "google_bigquery_dataset" "logs" {
  dataset_id                  = "logs"
  default_table_expiration_ms = 30 * 24 * 60 * 60 * 1000  # 30d
  default_partition_expiration_ms = 30 * 24 * 60 * 60 * 1000
  labels = var.labels
  # Important: force partition filters for cost control
  default_encryption_configuration { kms_key_name = google_kms_crypto_key.bq.id }
}

Require partition filter at table level

resource "google_bigquery_table" "logs" {
  dataset_id = google_bigquery_dataset.logs.dataset_id
  table_id   = "http_access"
  time_partitioning { type = "DAY" }
  require_partition_filter = true
  labels = var.labels
}

GCS lifecycle

resource "google_storage_bucket" "logs" {
  name     = "logs-${local.env}"
  location = "EU"
  versioning { enabled = true }
  lifecycle_rule {
    condition { age = 30 }
    action    { type = "Delete" }
  }
  labels = var.labels
}

4) Org policies & IAM boundaries - stop expensive mistakes up front

resource "google_org_policy_policy" "disable_serial_ports" {
  name   = "organizations/${var.org_id}/policies/compute.disableSerialPortAccess"
  parent = "organizations/${var.org_id}"
  spec {
    rules { enforce = true }
  }
}

resource "google_org_policy_policy" "vm_external_ip_denied" {
  name   = "organizations/${var.org_id}/policies/compute.vmExternalIpAccess"
  parent = "organizations/${var.org_id}"
  spec { rules { deny_all = true } } # enforce private-by-default; use exceptions per project
}

Add policies for uniform bucket-level access, CMEK required, restrict egress locations, and allowed machine types to prevent oversized SKUs.


5) CMEK (customer-managed keys) - encryption defaults

resource "google_kms_key_ring" "platform" {
  name     = "platform"
  location = "europe-west3"
}
resource "google_kms_crypto_key" "bq" {
  name            = "bq-default"
  key_ring        = google_kms_key_ring.platform.id
  rotation_period = "7776000s" # 90 days
}
# Use in BigQuery/GCS/Disks as defaults (see dataset example above)

6) p95-based autoscaling (K8s) - scale on user pain, not CPU

Export http_p95_latency_ms (OTel/Datadog -> Prom -> custom metric). Drive HPA off latency/queue depth, not CPU.

HPA sketch

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 2
  maxReplicas: 20
  behavior: { scaleDown: { stabilizationWindowSeconds: 300 } }
  metrics:
  - type: Pods
    pods:
      metric: { name: http_p95_latency_ms }
      target: { type: AverageValue, averageValue: "250" } # ms SLO

KEDA for queue lag

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker-queue }
spec:
  scaleTargetRef: { kind: Deployment, name: worker }
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: queue_lag
      threshold: "1000"

You can manage HPA/KEDA via Terraform’s kubernetes_manifest provider if you prefer IaC ownership.


7) Kill-switches for retry storms (operational safety)

  • Config flag to disable retries or lower concurrency.
  • Wire as a ConfigMap/Secret that Terraform can set for incident mode.
apiVersion: v1
kind: ConfigMap
metadata: { name: ops-flags }
data:
  DISABLE_RETRY: "false"
  MAX_CONCURRENCY: "8"

Your worker reads these at runtime; toggling reduces cascading failures and cost runaways during incidents.


8) Scheduled savings (non-prod off-hours)

Turn down non-prod nights/weekends.

resource "google_cloud_scheduler_job" "stop_nonprod" {
  name        = "stop-nonprod-evening"
  schedule    = "0 20 * * 1-5" # 20:00 Mon-Fri
  http_target {
    uri         = google_cloud_run_service.ops_hook.status[0].url
    http_method = "POST"
    oidc_token { service_account_email = google_service_account.ops.email }
    body        = base64encode("{\"action\":\"scale_down\"}")
  }
}

9) Monitoring policies (SLO & cost)

Spend anomaly alert (Monitoring)

resource "google_monitoring_alert_policy" "cost_anomaly" {
  display_name = "Cost Anomaly Alert"
  conditions {
    display_name = "Daily cost > baseline * 1.5"
    condition_monitoring_query_language {
      query = "fetch billing | metric 'billing.googleapis.com/daily_cost' | condition ratio > 1.5"
      duration = "1800s"
      trigger { count = 1 }
    }
  }
  notification_channels = [google_monitoring_notification_channel.email.id]
}

Pair this with SLO burn-rate alerts (availability & latency) so you’re paging on user pain and watching spend.


10) Policy-as-code in CI (prevent bad merges)

  • terraform fmt -check / validate / plan
  • Infracost to show € delta in PRs
  • OPA/Conftest/Sentinel to block policies (no external IPs, TTL required, labels required)

GitHub Actions sketch

- name: Terraform Validate
  run: terraform validate

- name: Infracost
  uses: infracost/actions/setup@v2
# ... compute & post diff as PR comment

- name: Conftest (OPA)
  run: conftest test policy/ --input terraform.plan.json

11) Drift detection & prevent_destroy

Detect drift regularly; and protect critical resources.

resource "google_bigquery_dataset" "core" {
  dataset_id = "core"
  # ...
  lifecycle {
    prevent_destroy = true
    ignore_changes  = [labels] # if labels mutate out-of-band
  }
}

12) Environment separation (blast radius)

  • Separate projects per env (proj-nonprod, proj-prod).
  • Separate state per env/team.
  • Use service perimeters (VPC-SC) for sensitive data projects.

13) Optional: reservations & caps (BigQuery, GKE)

  • BigQuery Reservations for predictable cost; assign by label/project.
  • GKE Autopilot with max nodes per node pool; cap cluster scale to avoid budget blow-outs.

14) Documentation block (per module)

Each module README should state:

  • Labels required, CMEK used, Org policies assumed
  • SLO/Alert hooks (what metrics/alerts are provided)
  • Cost levers (TTL, partition filters, SKU size)

Recap outcome

  • 20–30% cost reduction via TTLs, partition filters, right-sizing, off-hours.
  • Fewer incidents with p95-driven scaling, retry kill-switches, and org policies.
  • Audit ease thanks to labels, CMEK, budgets, and policy-as-code.

Minimal checklist (print & enforce)

  • Remote state + provider versions pinned
  • Labels enforced on all resources
  • Budgets + cost anomaly alerts wired
  • BigQuery require_partition_filter + TTLs
  • Org policies: external IPs off (default), CMEK required
  • p95/queue-based autoscaling; HPA/KEDA in IaC
  • Retry kill-switches; runbook link from alerts
  • Non-prod schedules; drift detection; prevent_destroy
  • CI: validate/plan + Infracost + OPA policy checks

This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.

© Copyright 2017-2025