DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
Takeaway: a handful of Terraform patterns prevent surprise spend, reduce pager incidents, and make audits easy.
Remote state + locking
terraform {
required_version = ">= 1.6.0"
backend "gcs" {
bucket = "tf-state-prod"
prefix = "platform"
}
required_providers {
google = { source = "hashicorp/google", version = "~> 5.30" }
}
}
Naming & labels
locals {
name = "api"
env = "prod"
labels = { owner = "platform", env = local.env, costcenter = "fintech" }
}
variable "labels" {
type = map(string)
default = { owner = "platform", env = "prod", costcenter = "fintech" }
}
resource "google_compute_instance" "api" {
name = "api-prod-1"
# ...
labels = var.labels
}
Make
labelsa module input and require it for all resources. It unlocks per-team cost allocation and better budgets.
resource "google_billing_budget" "prod" {
amount { specified_amount { units = "20000" } } # €20k/month
threshold_rules { threshold_percent = 0.5 }
threshold_rules { threshold_percent = 0.9 }
all_updates_rule {
pubsub_topic = google_pubsub_topic.budget.id
schema_version = "1.0"
monitoring_notification_channels = [google_monitoring_notification_channel.email.id]
}
}
Route Pub/Sub to Cloud Functions/Run -> Slack. Add per-label budgets for big spenders (datasets, projects).
BigQuery dataset defaults
resource "google_bigquery_dataset" "logs" {
dataset_id = "logs"
default_table_expiration_ms = 30 * 24 * 60 * 60 * 1000 # 30d
default_partition_expiration_ms = 30 * 24 * 60 * 60 * 1000
labels = var.labels
# Important: force partition filters for cost control
default_encryption_configuration { kms_key_name = google_kms_crypto_key.bq.id }
}
Require partition filter at table level
resource "google_bigquery_table" "logs" {
dataset_id = google_bigquery_dataset.logs.dataset_id
table_id = "http_access"
time_partitioning { type = "DAY" }
require_partition_filter = true
labels = var.labels
}
GCS lifecycle
resource "google_storage_bucket" "logs" {
name = "logs-${local.env}"
location = "EU"
versioning { enabled = true }
lifecycle_rule {
condition { age = 30 }
action { type = "Delete" }
}
labels = var.labels
}
resource "google_org_policy_policy" "disable_serial_ports" {
name = "organizations/${var.org_id}/policies/compute.disableSerialPortAccess"
parent = "organizations/${var.org_id}"
spec {
rules { enforce = true }
}
}
resource "google_org_policy_policy" "vm_external_ip_denied" {
name = "organizations/${var.org_id}/policies/compute.vmExternalIpAccess"
parent = "organizations/${var.org_id}"
spec { rules { deny_all = true } } # enforce private-by-default; use exceptions per project
}
Add policies for uniform bucket-level access, CMEK required, restrict egress locations, and allowed machine types to prevent oversized SKUs.
resource "google_kms_key_ring" "platform" {
name = "platform"
location = "europe-west3"
}
resource "google_kms_crypto_key" "bq" {
name = "bq-default"
key_ring = google_kms_key_ring.platform.id
rotation_period = "7776000s" # 90 days
}
# Use in BigQuery/GCS/Disks as defaults (see dataset example above)
Export http_p95_latency_ms (OTel/Datadog -> Prom -> custom metric). Drive HPA off latency/queue depth, not CPU.
HPA sketch
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
minReplicas: 2
maxReplicas: 20
behavior: { scaleDown: { stabilizationWindowSeconds: 300 } }
metrics:
- type: Pods
pods:
metric: { name: http_p95_latency_ms }
target: { type: AverageValue, averageValue: "250" } # ms SLO
KEDA for queue lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker-queue }
spec:
scaleTargetRef: { kind: Deployment, name: worker }
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: queue_lag
threshold: "1000"
You can manage HPA/KEDA via Terraform’s
kubernetes_manifestprovider if you prefer IaC ownership.
apiVersion: v1
kind: ConfigMap
metadata: { name: ops-flags }
data:
DISABLE_RETRY: "false"
MAX_CONCURRENCY: "8"
Your worker reads these at runtime; toggling reduces cascading failures and cost runaways during incidents.
Turn down non-prod nights/weekends.
resource "google_cloud_scheduler_job" "stop_nonprod" {
name = "stop-nonprod-evening"
schedule = "0 20 * * 1-5" # 20:00 Mon-Fri
http_target {
uri = google_cloud_run_service.ops_hook.status[0].url
http_method = "POST"
oidc_token { service_account_email = google_service_account.ops.email }
body = base64encode("{\"action\":\"scale_down\"}")
}
}
Spend anomaly alert (Monitoring)
resource "google_monitoring_alert_policy" "cost_anomaly" {
display_name = "Cost Anomaly Alert"
conditions {
display_name = "Daily cost > baseline * 1.5"
condition_monitoring_query_language {
query = "fetch billing | metric 'billing.googleapis.com/daily_cost' | condition ratio > 1.5"
duration = "1800s"
trigger { count = 1 }
}
}
notification_channels = [google_monitoring_notification_channel.email.id]
}
Pair this with SLO burn-rate alerts (availability & latency) so you’re paging on user pain and watching spend.
terraform fmt -check / validate / planGitHub Actions sketch
- name: Terraform Validate
run: terraform validate
- name: Infracost
uses: infracost/actions/setup@v2
# ... compute & post diff as PR comment
- name: Conftest (OPA)
run: conftest test policy/ --input terraform.plan.json
prevent_destroyDetect drift regularly; and protect critical resources.
resource "google_bigquery_dataset" "core" {
dataset_id = "core"
# ...
lifecycle {
prevent_destroy = true
ignore_changes = [labels] # if labels mutate out-of-band
}
}
proj-nonprod, proj-prod).Each module README should state:
prevent_destroyThis is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.