DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
Context: BigQuery is fast to adopt-and easy to overspend on. Here’s a comprehensive checklist I use in migrations to avoid common cost pitfalls.
Migrating from Hadoop/Oracle to BigQuery can reduce costs by 50-70%, but only if you avoid these common mistakes. This guide covers the 8 most expensive pitfalls and how to fix them.
The Problem: Tables designed without understanding query patterns lead to full table scans and expensive joins.
Why it’s expensive:
The Fix: Model serving tables to match WHERE/JOIN patterns; pre-compute heavy joins.
Example:
-- BAD: Generic table, every query scans everything
CREATE TABLE events (
event_id STRING,
user_id STRING,
event_type STRING,
event_ts TIMESTAMP,
data JSON
);
-- GOOD: Serving table optimized for common queries
CREATE TABLE serving.user_events_recent AS
SELECT
user_id,
event_type,
DATE(event_ts) AS event_date,
COUNT(*) AS event_count,
ARRAY_AGG(STRUCT(event_id, event_ts, data) ORDER BY event_ts DESC LIMIT 100) AS recent_events
FROM
landing.raw_events
WHERE
event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
user_id, event_type, DATE(event_ts);
-- Pre-computed join for dashboard queries
CREATE TABLE serving.user_events_with_profile AS
SELECT
ue.*,
up.name,
up.email,
up.subscription_tier
FROM
serving.user_events_recent ue
JOIN
curated.user_profiles up
USING (user_id);
Best practices:
The Problem: Without partitioning and clustering, every query scans the entire table.
Why it’s expensive:
The Fix: Always partition by date/time; cluster by high-cardinality columns used in WHERE/JOIN.
Example:
-- BAD: No partitioning, every query scans everything
CREATE TABLE events (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
data JSON
);
-- GOOD: Partitioned by date, clustered by user_id and event_type
CREATE TABLE curated.fact_events (
event_id STRING,
user_id STRING,
event_type STRING,
event_ts TIMESTAMP,
data JSON
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
OPTIONS (
description = "Partitioned by date for time-range queries, clustered by user_id and event_type for user-specific queries"
);
Partitioning strategy:
shard_id)Clustering strategy:
Real-world example:
-- Events table: partitioned by date, clustered by user and type
CREATE TABLE curated.events
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
AS
SELECT * FROM landing.raw_events;
-- Query only scans relevant partitions
SELECT COUNT(*)
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-07' -- Only scans 7 partitions
AND user_id = 'user-123' -- Clustering helps
AND event_type = 'purchase';
The Problem: Wildcard queries scan all matching tables/partitions, even if you only need recent data.
Why it’s expensive:
The Fix: Always filter by _PARTITIONDATE or partition column; prefer parameterized queries.
Example:
-- BAD: Wildcard scans all partitions (expensive!)
SELECT *
FROM `project.dataset.events_*`
WHERE event_type = 'purchase';
-- GOOD: Filter by partition date first
SELECT *
FROM `project.dataset.events_*`
WHERE _PARTITIONDATE BETWEEN '2025-01-01' AND '2025-01-31'
AND event_type = 'purchase';
-- BETTER: Use partitioned table with date filter
SELECT *
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
AND event_type = 'purchase';
Best practices:
_PARTITIONTIME for timestamp-partitioned tablesParameterized query example:
-- Enforce partition filter via parameter
DECLARE start_date DATE DEFAULT '2025-01-01';
DECLARE end_date DATE DEFAULT '2025-01-31';
SELECT *
FROM curated.events
WHERE event_date BETWEEN start_date AND end_date
AND event_type = 'purchase';
The Problem: Clustering by high-cardinality or skewed columns causes hot partitions and slow queries.
Why it’s expensive:
The Fix: Re-shard by adding a salting key; avoid clustering by super-high-cardinality IDs.
Example:
-- BAD: Clustering by UUID (high cardinality, no benefit)
CREATE TABLE events
PARTITION BY DATE(event_ts)
CLUSTER BY event_id -- UUID, 1B+ unique values, no benefit
AS SELECT * FROM landing.raw_events;
-- GOOD: Add salting key for high-cardinality columns
CREATE TABLE curated.events
PARTITION BY DATE(event_ts)
CLUSTER BY shard_id, user_id -- Shard_id = MOD(FARM_FINGERPRINT(event_id), 100)
AS
SELECT
*,
MOD(FARM_FINGERPRINT(event_id), 100) AS shard_id -- 100 shards
FROM landing.raw_events;
-- Query with shard filter for better distribution
SELECT *
FROM curated.events
WHERE event_date = '2025-01-15'
AND shard_id = MOD(FARM_FINGERPRINT('target-event-id'), 100)
AND event_id = 'target-event-id';
When to use salting:
Salting strategy:
FARM_FINGERPRINT for deterministic hashingThe Problem: Using raw data as the final model leads to repeated expensive transformations in every query.
Why it’s expensive:
The Fix: Implement a layered architecture: landing -> curated -> serving; materialize BI views where needed; document lineage.
Example architecture:
-- Layer 1: Landing (raw data, minimal transformation)
CREATE TABLE landing.raw_events (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
raw_data JSON
)
PARTITION BY DATE(event_ts);
-- Layer 2: Curated (cleaned, validated, typed)
CREATE TABLE curated.events (
event_id STRING,
user_id STRING,
event_type STRING,
event_ts TIMESTAMP,
event_data STRUCT<
product_id STRING,
amount FLOAT64,
currency STRING
>
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
AS
SELECT
event_id,
user_id,
JSON_EXTRACT_SCALAR(raw_data, '$.type') AS event_type,
event_ts,
STRUCT(
JSON_EXTRACT_SCALAR(raw_data, '$.product_id') AS product_id,
CAST(JSON_EXTRACT_SCALAR(raw_data, '$.amount') AS FLOAT64) AS amount,
JSON_EXTRACT_SCALAR(raw_data, '$.currency') AS currency
) AS event_data
FROM landing.raw_events
WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY);
-- Layer 3: Serving (pre-aggregated, optimized for queries)
CREATE TABLE serving.user_daily_stats (
user_id STRING,
event_date DATE,
total_events INT64,
total_revenue FLOAT64,
unique_products INT64,
last_event_ts TIMESTAMP
)
PARTITION BY event_date
CLUSTER BY user_id
AS
SELECT
user_id,
DATE(event_ts) AS event_date,
COUNT(*) AS total_events,
SUM(IF(event_type = 'purchase', event_data.amount, 0)) AS total_revenue,
COUNT(DISTINCT event_data.product_id) AS unique_products,
MAX(event_ts) AS last_event_ts
FROM curated.events
WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_id, DATE(event_ts);
Best practices:
The Problem: Keeping all data forever leads to ever-increasing storage and query costs.
Why it’s expensive:
The Fix: Implement table TTLs and partition expiration; set log retention by class (hot/warm/cold).
Example:
-- GOOD: Partition expiration for old data
CREATE TABLE curated.events (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
data JSON
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
OPTIONS (
partition_expiration_days = 90, -- Auto-delete partitions older than 90 days
description = "Events table with 90-day retention"
);
-- GOOD: Table-level TTL for entire tables
CREATE TABLE staging.temp_results (
result_id STRING,
data JSON,
created_at TIMESTAMP
)
OPTIONS (
expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) -- Delete after 7 days
);
-- GOOD: Different retention by data class
-- Hot data: 30 days (frequently queried)
CREATE TABLE serving.events_recent
PARTITION BY DATE(event_ts)
OPTIONS (partition_expiration_days = 30)
AS SELECT * FROM curated.events WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY);
-- Warm data: 90 days (occasionally queried)
CREATE TABLE curated.events
PARTITION BY DATE(event_ts)
OPTIONS (partition_expiration_days = 90)
AS SELECT * FROM landing.raw_events WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY);
-- Cold data: Archive to Cloud Storage (rarely queried)
-- Use BigQuery export jobs to move old partitions to GCS
Retention strategy:
Cost savings:
The Problem: Without guardrails, runaway queries can cost thousands of dollars in minutes.
Why it’s expensive:
The Fix: Set per-dataset quotas, use reservations for steady workloads, implement cost controls (budgets, alerts).
Example:
-- GOOD: Per-dataset query size limit
-- Set via BigQuery console or gcloud:
gcloud alpha bq datasets update my_dataset \
--default-table-expiration=3600 \
--default-partition-expiration=86400 \
--max-staleness=3600
-- GOOD: Per-query size limit (via query options)
SELECT *
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
OPTIONS (
maximum_bytes_billed = 10000000000 -- 10 GB limit
);
-- GOOD: Per-user daily quota (via IAM)
-- Set in BigQuery console: IAM & Admin > Quotas
-- Limit: 1 TB per user per day
Cost control strategy:
maximum_bytes_billed in queriesBudget alert example:
# Create budget alert at 90% of monthly budget
gcloud billing budgets create \
--billing-account=YOUR_BILLING_ACCOUNT \
--display-name="BigQuery Budget Alert" \
--budget-amount=1000USD \
--threshold-rule=percent=90 \
--threshold-rule=percent=100 \
--filter-projects=PROJECT_ID
The Problem: Using a single CTE in multiple downstream SELECTs can cause multiple executions, multiplying costs.
Why it’s expensive:
The Fix: For CTEs used multiple times, materialize to a temp table to avoid repeated execution.
Example:
-- BAD: CTE used multiple times (may execute multiple times)
WITH expensive_cte AS (
SELECT
user_id,
event_type,
COUNT(*) AS event_count
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY user_id, event_type
)
SELECT
user_id,
SUM(event_count) AS total_events
FROM expensive_cte
GROUP BY user_id
UNION ALL
SELECT
event_type,
SUM(event_count) AS total_events
FROM expensive_cte
GROUP BY event_type;
-- GOOD: Materialize to temp table (executes once)
CREATE TEMP TABLE temp_user_events AS
SELECT
user_id,
event_type,
COUNT(*) AS event_count
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY user_id, event_type;
SELECT
user_id,
SUM(event_count) AS total_events
FROM temp_user_events
GROUP BY user_id
UNION ALL
SELECT
event_type,
SUM(event_count) AS total_events
FROM temp_user_events
GROUP BY event_type;
When to use temp tables:
Best practices:
Quick win SQL (cap scan size & prove impact)
DECLARE max_bytes INT64 DEFAULT 10000000000; -- 10 GB
SELECT * FROM curated.fact_events
WHERE event_date BETWEEN @start AND @end
OPTIONS (max_bytes_billed = max_bytes);
Outcome: predictable cost, faster queries, cleaner analytics. Start with a 2-week readiness check, then a 4-week blueprint.
This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.