DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
Takeaway: a handful of Terraform patterns prevent surprise spend, reduce pager incidents, and make audits easy.
Remote state + locking
terraform {
required_version = ">= 1.6.0"
backend "gcs" {
bucket = "tf-state-prod"
prefix = "platform"
}
required_providers {
google = { source = "hashicorp/google", version = "~> 5.30" }
}
}
Naming & labels
locals {
name = "api"
env = "prod"
labels = { owner = "platform", env = local.env, costcenter = "fintech" }
}
variable "labels" {
type = map(string)
default = { owner = "platform", env = "prod", costcenter = "fintech" }
}
resource "google_compute_instance" "api" {
name = "api-prod-1"
# ...
labels = var.labels
}
Make
labelsa module input and require it for all resources. It unlocks per-team cost allocation and better budgets.
resource "google_billing_budget" "prod" {
amount { specified_amount { units = "20000" } } # €20k/month
threshold_rules { threshold_percent = 0.5 }
threshold_rules { threshold_percent = 0.9 }
all_updates_rule {
pubsub_topic = google_pubsub_topic.budget.id
schema_version = "1.0"
monitoring_notification_channels = [google_monitoring_notification_channel.email.id]
}
}
Route Pub/Sub to Cloud Functions/Run -> Slack. Add per-label budgets for big spenders (datasets, projects).
BigQuery dataset defaults
resource "google_bigquery_dataset" "logs" {
dataset_id = "logs"
default_table_expiration_ms = 30 * 24 * 60 * 60 * 1000 # 30d
default_partition_expiration_ms = 30 * 24 * 60 * 60 * 1000
labels = var.labels
# Important: force partition filters for cost control
default_encryption_configuration { kms_key_name = google_kms_crypto_key.bq.id }
}
Require partition filter at table level
resource "google_bigquery_table" "logs" {
dataset_id = google_bigquery_dataset.logs.dataset_id
table_id = "http_access"
time_partitioning { type = "DAY" }
require_partition_filter = true
labels = var.labels
}
GCS lifecycle
resource "google_storage_bucket" "logs" {
name = "logs-${local.env}"
location = "EU"
versioning { enabled = true }
lifecycle_rule {
condition { age = 30 }
action { type = "Delete" }
}
labels = var.labels
}
resource "google_org_policy_policy" "disable_serial_ports" {
name = "organizations/${var.org_id}/policies/compute.disableSerialPortAccess"
parent = "organizations/${var.org_id}"
spec {
rules { enforce = true }
}
}
resource "google_org_policy_policy" "vm_external_ip_denied" {
name = "organizations/${var.org_id}/policies/compute.vmExternalIpAccess"
parent = "organizations/${var.org_id}"
spec { rules { deny_all = true } } # enforce private-by-default; use exceptions per project
}
Add policies for uniform bucket-level access, CMEK required, restrict egress locations, and allowed machine types to prevent oversized SKUs.
resource "google_kms_key_ring" "platform" {
name = "platform"
location = "europe-west3"
}
resource "google_kms_crypto_key" "bq" {
name = "bq-default"
key_ring = google_kms_key_ring.platform.id
rotation_period = "7776000s" # 90 days
}
# Use in BigQuery/GCS/Disks as defaults (see dataset example above)
Export http_p95_latency_ms (OTel/Datadog -> Prom -> custom metric). Drive HPA off latency/queue depth, not CPU.
HPA sketch
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
minReplicas: 2
maxReplicas: 20
behavior: { scaleDown: { stabilizationWindowSeconds: 300 } }
metrics:
- type: Pods
pods:
metric: { name: http_p95_latency_ms }
target: { type: AverageValue, averageValue: "250" } # ms SLO
KEDA for queue lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker-queue }
spec:
scaleTargetRef: { kind: Deployment, name: worker }
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: queue_lag
threshold: "1000"
You can manage HPA/KEDA via Terraform’s
kubernetes_manifestprovider if you prefer IaC ownership.
apiVersion: v1
kind: ConfigMap
metadata: { name: ops-flags }
data:
DISABLE_RETRY: "false"
MAX_CONCURRENCY: "8"
Your worker reads these at runtime; toggling reduces cascading failures and cost runaways during incidents.
Turn down non-prod nights/weekends.
resource "google_cloud_scheduler_job" "stop_nonprod" {
name = "stop-nonprod-evening"
schedule = "0 20 * * 1-5" # 20:00 Mon-Fri
http_target {
uri = google_cloud_run_service.ops_hook.status[0].url
http_method = "POST"
oidc_token { service_account_email = google_service_account.ops.email }
body = base64encode("{\"action\":\"scale_down\"}")
}
}
Spend anomaly alert (Monitoring)
resource "google_monitoring_alert_policy" "cost_anomaly" {
display_name = "Cost Anomaly Alert"
conditions {
display_name = "Daily cost > baseline * 1.5"
condition_monitoring_query_language {
query = "fetch billing | metric 'billing.googleapis.com/daily_cost' | condition ratio > 1.5"
duration = "1800s"
trigger { count = 1 }
}
}
notification_channels = [google_monitoring_notification_channel.email.id]
}
Pair this with SLO burn-rate alerts (availability & latency) so you’re paging on user pain and watching spend.
terraform fmt -check / validate / planGitHub Actions sketch
- name: Terraform Validate
run: terraform validate
- name: Infracost
uses: infracost/actions/setup@v2
# ... compute & post diff as PR comment
- name: Conftest (OPA)
run: conftest test policy/ --input terraform.plan.json
prevent_destroyDetect drift regularly; and protect critical resources.
resource "google_bigquery_dataset" "core" {
dataset_id = "core"
# ...
lifecycle {
prevent_destroy = true
ignore_changes = [labels] # if labels mutate out-of-band
}
}
proj-nonprod, proj-prod).Each module README should state:
prevent_destroy