DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
Context: BigQuery is fast to adopt-and easy to overspend on. Here’s a comprehensive checklist I use in migrations to avoid common cost pitfalls.
Migrating from Hadoop/Oracle to BigQuery can reduce costs by 50-70%, but only if you avoid these common mistakes. This guide covers the 8 most expensive pitfalls and how to fix them.
The Problem: Tables designed without understanding query patterns lead to full table scans and expensive joins.
Why it’s expensive:
The Fix: Model serving tables to match WHERE/JOIN patterns; pre-compute heavy joins.
Example:
-- BAD: Generic table, every query scans everything
CREATE TABLE events (
event_id STRING,
user_id STRING,
event_type STRING,
event_ts TIMESTAMP,
data JSON
);
-- GOOD: Serving table optimized for common queries
CREATE TABLE serving.user_events_recent AS
SELECT
user_id,
event_type,
DATE(event_ts) AS event_date,
COUNT(*) AS event_count,
ARRAY_AGG(STRUCT(event_id, event_ts, data) ORDER BY event_ts DESC LIMIT 100) AS recent_events
FROM
landing.raw_events
WHERE
event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
user_id, event_type, DATE(event_ts);
-- Pre-computed join for dashboard queries
CREATE TABLE serving.user_events_with_profile AS
SELECT
ue.*,
up.name,
up.email,
up.subscription_tier
FROM
serving.user_events_recent ue
JOIN
curated.user_profiles up
USING (user_id);
Best practices:
The Problem: Without partitioning and clustering, every query scans the entire table.
Why it’s expensive:
The Fix: Always partition by date/time; cluster by high-cardinality columns used in WHERE/JOIN.
Example:
-- BAD: No partitioning, every query scans everything
CREATE TABLE events (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
data JSON
);
-- GOOD: Partitioned by date, clustered by user_id and event_type
CREATE TABLE curated.fact_events (
event_id STRING,
user_id STRING,
event_type STRING,
event_ts TIMESTAMP,
data JSON
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
OPTIONS (
description = "Partitioned by date for time-range queries, clustered by user_id and event_type for user-specific queries"
);
Partitioning strategy:
shard_id)Clustering strategy:
Real-world example:
-- Events table: partitioned by date, clustered by user and type
CREATE TABLE curated.events
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
AS
SELECT * FROM landing.raw_events;
-- Query only scans relevant partitions
SELECT COUNT(*)
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-07' -- Only scans 7 partitions
AND user_id = 'user-123' -- Clustering helps
AND event_type = 'purchase';
The Problem: Wildcard queries scan all matching tables/partitions, even if you only need recent data.
Why it’s expensive:
The Fix: Always filter by _PARTITIONDATE or partition column; prefer parameterized queries.
Example:
-- BAD: Wildcard scans all partitions (expensive!)
SELECT *
FROM `project.dataset.events_*`
WHERE event_type = 'purchase';
-- GOOD: Filter by partition date first
SELECT *
FROM `project.dataset.events_*`
WHERE _PARTITIONDATE BETWEEN '2025-01-01' AND '2025-01-31'
AND event_type = 'purchase';
-- BETTER: Use partitioned table with date filter
SELECT *
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
AND event_type = 'purchase';
Best practices:
_PARTITIONTIME for timestamp-partitioned tablesParameterized query example:
-- Enforce partition filter via parameter
DECLARE start_date DATE DEFAULT '2025-01-01';
DECLARE end_date DATE DEFAULT '2025-01-31';
SELECT *
FROM curated.events
WHERE event_date BETWEEN start_date AND end_date
AND event_type = 'purchase';
The Problem: Clustering by high-cardinality or skewed columns causes hot partitions and slow queries.
Why it’s expensive:
The Fix: Re-shard by adding a salting key; avoid clustering by super-high-cardinality IDs.
Example:
-- BAD: Clustering by UUID (high cardinality, no benefit)
CREATE TABLE events
PARTITION BY DATE(event_ts)
CLUSTER BY event_id -- UUID, 1B+ unique values, no benefit
AS SELECT * FROM landing.raw_events;
-- GOOD: Add salting key for high-cardinality columns
CREATE TABLE curated.events
PARTITION BY DATE(event_ts)
CLUSTER BY shard_id, user_id -- Shard_id = MOD(FARM_FINGERPRINT(event_id), 100)
AS
SELECT
*,
MOD(FARM_FINGERPRINT(event_id), 100) AS shard_id -- 100 shards
FROM landing.raw_events;
-- Query with shard filter for better distribution
SELECT *
FROM curated.events
WHERE event_date = '2025-01-15'
AND shard_id = MOD(FARM_FINGERPRINT('target-event-id'), 100)
AND event_id = 'target-event-id';
When to use salting:
Salting strategy:
FARM_FINGERPRINT for deterministic hashingThe Problem: Using raw data as the final model leads to repeated expensive transformations in every query.
Why it’s expensive:
The Fix: Implement a layered architecture: landing -> curated -> serving; materialize BI views where needed; document lineage.
Example architecture:
-- Layer 1: Landing (raw data, minimal transformation)
CREATE TABLE landing.raw_events (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
raw_data JSON
)
PARTITION BY DATE(event_ts);
-- Layer 2: Curated (cleaned, validated, typed)
CREATE TABLE curated.events (
event_id STRING,
user_id STRING,
event_type STRING,
event_ts TIMESTAMP,
event_data STRUCT<
product_id STRING,
amount FLOAT64,
currency STRING
>
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
AS
SELECT
event_id,
user_id,
JSON_EXTRACT_SCALAR(raw_data, '$.type') AS event_type,
event_ts,
STRUCT(
JSON_EXTRACT_SCALAR(raw_data, '$.product_id') AS product_id,
CAST(JSON_EXTRACT_SCALAR(raw_data, '$.amount') AS FLOAT64) AS amount,
JSON_EXTRACT_SCALAR(raw_data, '$.currency') AS currency
) AS event_data
FROM landing.raw_events
WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY);
-- Layer 3: Serving (pre-aggregated, optimized for queries)
CREATE TABLE serving.user_daily_stats (
user_id STRING,
event_date DATE,
total_events INT64,
total_revenue FLOAT64,
unique_products INT64,
last_event_ts TIMESTAMP
)
PARTITION BY event_date
CLUSTER BY user_id
AS
SELECT
user_id,
DATE(event_ts) AS event_date,
COUNT(*) AS total_events,
SUM(IF(event_type = 'purchase', event_data.amount, 0)) AS total_revenue,
COUNT(DISTINCT event_data.product_id) AS unique_products,
MAX(event_ts) AS last_event_ts
FROM curated.events
WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_id, DATE(event_ts);
Best practices:
The Problem: Keeping all data forever leads to ever-increasing storage and query costs.
Why it’s expensive:
The Fix: Implement table TTLs and partition expiration; set log retention by class (hot/warm/cold).
Example:
-- GOOD: Partition expiration for old data
CREATE TABLE curated.events (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
data JSON
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
OPTIONS (
partition_expiration_days = 90, -- Auto-delete partitions older than 90 days
description = "Events table with 90-day retention"
);
-- GOOD: Table-level TTL for entire tables
CREATE TABLE staging.temp_results (
result_id STRING,
data JSON,
created_at TIMESTAMP
)
OPTIONS (
expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) -- Delete after 7 days
);
-- GOOD: Different retention by data class
-- Hot data: 30 days (frequently queried)
CREATE TABLE serving.events_recent
PARTITION BY DATE(event_ts)
OPTIONS (partition_expiration_days = 30)
AS SELECT * FROM curated.events WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY);
-- Warm data: 90 days (occasionally queried)
CREATE TABLE curated.events
PARTITION BY DATE(event_ts)
OPTIONS (partition_expiration_days = 90)
AS SELECT * FROM landing.raw_events WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY);
-- Cold data: Archive to Cloud Storage (rarely queried)
-- Use BigQuery export jobs to move old partitions to GCS
Retention strategy:
Cost savings:
The Problem: Without guardrails, runaway queries can cost thousands of dollars in minutes.
Why it’s expensive:
The Fix: Set per-dataset quotas, use reservations for steady workloads, implement cost controls (budgets, alerts).
Example:
-- GOOD: Per-dataset query size limit
-- Set via BigQuery console or gcloud:
gcloud alpha bq datasets update my_dataset \
--default-table-expiration=3600 \
--default-partition-expiration=86400 \
--max-staleness=3600
-- GOOD: Per-query size limit (via query options)
SELECT *
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
OPTIONS (
maximum_bytes_billed = 10000000000 -- 10 GB limit
);
-- GOOD: Per-user daily quota (via IAM)
-- Set in BigQuery console: IAM & Admin > Quotas
-- Limit: 1 TB per user per day
Cost control strategy:
maximum_bytes_billed in queriesBudget alert example:
# Create budget alert at 90% of monthly budget
gcloud billing budgets create \
--billing-account=YOUR_BILLING_ACCOUNT \
--display-name="BigQuery Budget Alert" \
--budget-amount=1000USD \
--threshold-rule=percent=90 \
--threshold-rule=percent=100 \
--filter-projects=PROJECT_ID
The Problem: Using a single CTE in multiple downstream SELECTs can cause multiple executions, multiplying costs.
Why it’s expensive:
The Fix: For CTEs used multiple times, materialize to a temp table to avoid repeated execution.
Example:
-- BAD: CTE used multiple times (may execute multiple times)
WITH expensive_cte AS (
SELECT
user_id,
event_type,
COUNT(*) AS event_count
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY user_id, event_type
)
SELECT
user_id,
SUM(event_count) AS total_events
FROM expensive_cte
GROUP BY user_id
UNION ALL
SELECT
event_type,
SUM(event_count) AS total_events
FROM expensive_cte
GROUP BY event_type;
-- GOOD: Materialize to temp table (executes once)
CREATE TEMP TABLE temp_user_events AS
SELECT
user_id,
event_type,
COUNT(*) AS event_count
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY user_id, event_type;
SELECT
user_id,
SUM(event_count) AS total_events
FROM temp_user_events
GROUP BY user_id
UNION ALL
SELECT
event_type,
SUM(event_count) AS total_events
FROM temp_user_events
GROUP BY event_type;
When to use temp tables:
Best practices:
Quick win SQL (cap scan size & prove impact)
DECLARE max_bytes INT64 DEFAULT 10000000000; -- 10 GB
SELECT * FROM curated.fact_events
WHERE event_date BETWEEN @start AND @end
OPTIONS (max_bytes_billed = max_bytes);
Outcome: predictable cost, faster queries, cleaner analytics. Start with a 2-week readiness check, then a 4-week blueprint.