DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Context: BigQuery is fast to adopt-and easy to overspend on. Here’s a comprehensive checklist I use in migrations to avoid common cost pitfalls.

Migrating from Hadoop/Oracle to BigQuery can reduce costs by 50-70%, but only if you avoid these common mistakes. This guide covers the 8 most expensive pitfalls and how to fix them.


Pitfall 1: No Access-Pattern Design

The Problem: Tables designed without understanding query patterns lead to full table scans and expensive joins.

Why it’s expensive:

  • Full table scans: $5 per TB scanned
  • Unnecessary joins: Multiplied scan costs
  • Missing pre-aggregations: Repeated expensive computations

The Fix: Model serving tables to match WHERE/JOIN patterns; pre-compute heavy joins.

Example:

-- BAD: Generic table, every query scans everything
CREATE TABLE events (
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_ts TIMESTAMP,
  data JSON
);

-- GOOD: Serving table optimized for common queries
CREATE TABLE serving.user_events_recent AS
SELECT
  user_id,
  event_type,
  DATE(event_ts) AS event_date,
  COUNT(*) AS event_count,
  ARRAY_AGG(STRUCT(event_id, event_ts, data) ORDER BY event_ts DESC LIMIT 100) AS recent_events
FROM
  landing.raw_events
WHERE
  event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
  user_id, event_type, DATE(event_ts);

-- Pre-computed join for dashboard queries
CREATE TABLE serving.user_events_with_profile AS
SELECT
  ue.*,
  up.name,
  up.email,
  up.subscription_tier
FROM
  serving.user_events_recent ue
JOIN
  curated.user_profiles up
USING (user_id);

Best practices:

  • Analyze query logs from your old system to identify patterns
  • Create serving tables for common query patterns
  • Pre-compute joins that appear in > 10% of queries
  • Document access patterns in table comments

Pitfall 2: Missing Partition/Cluster Strategy

The Problem: Without partitioning and clustering, every query scans the entire table.

Why it’s expensive:

  • Full table scans: $5 per TB scanned
  • No query pruning: Can’t skip irrelevant data
  • Slow queries: Higher latency = higher costs

The Fix: Always partition by date/time; cluster by high-cardinality columns used in WHERE/JOIN.

Example:

--  BAD: No partitioning, every query scans everything
CREATE TABLE events (
  event_id STRING,
  user_id STRING,
  event_ts TIMESTAMP,
  data JSON
);

--  GOOD: Partitioned by date, clustered by user_id and event_type
CREATE TABLE curated.fact_events (
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_ts TIMESTAMP,
  data JSON
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
OPTIONS (
  description = "Partitioned by date for time-range queries, clustered by user_id and event_type for user-specific queries"
);

Partitioning strategy:

  • Daily partitions for high-volume data (events, logs)
  • Monthly partitions for lower-volume data (aggregates, reports)
  • Integer partitioning for non-date columns (e.g., shard_id)

Clustering strategy:

  • Cluster by columns used in WHERE clauses (user_id, account_id)
  • Cluster by join keys (foreign keys)
  • Limit to 4 columns (more columns = diminishing returns)
  • Avoid high-cardinality columns (UUIDs, hashes) unless frequently filtered

Real-world example:

-- Events table: partitioned by date, clustered by user and type
CREATE TABLE curated.events
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
AS
SELECT * FROM landing.raw_events;

-- Query only scans relevant partitions
SELECT COUNT(*) 
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-07'  -- Only scans 7 partitions
  AND user_id = 'user-123'                               -- Clustering helps
  AND event_type = 'purchase';

Pitfall 3: Full Scans via Wildcards

The Problem: Wildcard queries scan all matching tables/partitions, even if you only need recent data.

Why it’s expensive:

  • Scans all partitions: Even old data you don’t need
  • No partition pruning: Can’t skip irrelevant partitions
  • Multiplied costs: Each partition scanned separately

The Fix: Always filter by _PARTITIONDATE or partition column; prefer parameterized queries.

Example:

--  BAD: Wildcard scans all partitions (expensive!)
SELECT *
FROM `project.dataset.events_*`
WHERE event_type = 'purchase';

--  GOOD: Filter by partition date first
SELECT *
FROM `project.dataset.events_*`
WHERE _PARTITIONDATE BETWEEN '2025-01-01' AND '2025-01-31'
  AND event_type = 'purchase';

--  BETTER: Use partitioned table with date filter
SELECT *
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
  AND event_type = 'purchase';

Best practices:

  • Always filter by partition before other WHERE clauses
  • Use parameterized queries to enforce partition filters
  • Set query timeouts to prevent runaway scans
  • Use _PARTITIONTIME for timestamp-partitioned tables

Parameterized query example:

-- Enforce partition filter via parameter
DECLARE start_date DATE DEFAULT '2025-01-01';
DECLARE end_date DATE DEFAULT '2025-01-31';

SELECT *
FROM curated.events
WHERE event_date BETWEEN start_date AND end_date
  AND event_type = 'purchase';

Pitfall 4: Hot Keys & Skew

The Problem: Clustering by high-cardinality or skewed columns causes hot partitions and slow queries.

Why it’s expensive:

  • Hot partitions: Uneven data distribution
  • Slow queries: High latency = higher costs
  • Throttling: BigQuery may throttle skewed queries

The Fix: Re-shard by adding a salting key; avoid clustering by super-high-cardinality IDs.

Example:

--  BAD: Clustering by UUID (high cardinality, no benefit)
CREATE TABLE events
PARTITION BY DATE(event_ts)
CLUSTER BY event_id  -- UUID, 1B+ unique values, no benefit
AS SELECT * FROM landing.raw_events;

--  GOOD: Add salting key for high-cardinality columns
CREATE TABLE curated.events
PARTITION BY DATE(event_ts)
CLUSTER BY shard_id, user_id  -- Shard_id = MOD(FARM_FINGERPRINT(event_id), 100)
AS
SELECT
  *,
  MOD(FARM_FINGERPRINT(event_id), 100) AS shard_id  -- 100 shards
FROM landing.raw_events;

-- Query with shard filter for better distribution
SELECT *
FROM curated.events
WHERE event_date = '2025-01-15'
  AND shard_id = MOD(FARM_FINGERPRINT('target-event-id'), 100)
  AND event_id = 'target-event-id';

When to use salting:

  • High-cardinality columns (> 1M unique values)
  • Skewed distributions (80/20 rule)
  • Frequent point lookups by ID

Salting strategy:

  • 100-1000 shards typically sufficient
  • Use FARM_FINGERPRINT for deterministic hashing
  • Include shard in WHERE clause for best performance

Pitfall 5: Raw Dump as “Final Model”

The Problem: Using raw data as the final model leads to repeated expensive transformations in every query.

Why it’s expensive:

  • Repeated transformations: Same computation in every query
  • No optimization: Can’t pre-aggregate or pre-join
  • Complex queries: Higher complexity = higher costs

The Fix: Implement a layered architecture: landing -> curated -> serving; materialize BI views where needed; document lineage.

Example architecture:

-- Layer 1: Landing (raw data, minimal transformation)
CREATE TABLE landing.raw_events (
  event_id STRING,
  user_id STRING,
  event_ts TIMESTAMP,
  raw_data JSON
)
PARTITION BY DATE(event_ts);

-- Layer 2: Curated (cleaned, validated, typed)
CREATE TABLE curated.events (
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_ts TIMESTAMP,
  event_data STRUCT<
    product_id STRING,
    amount FLOAT64,
    currency STRING
  >
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
AS
SELECT
  event_id,
  user_id,
  JSON_EXTRACT_SCALAR(raw_data, '$.type') AS event_type,
  event_ts,
  STRUCT(
    JSON_EXTRACT_SCALAR(raw_data, '$.product_id') AS product_id,
    CAST(JSON_EXTRACT_SCALAR(raw_data, '$.amount') AS FLOAT64) AS amount,
    JSON_EXTRACT_SCALAR(raw_data, '$.currency') AS currency
  ) AS event_data
FROM landing.raw_events
WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY);

-- Layer 3: Serving (pre-aggregated, optimized for queries)
CREATE TABLE serving.user_daily_stats (
  user_id STRING,
  event_date DATE,
  total_events INT64,
  total_revenue FLOAT64,
  unique_products INT64,
  last_event_ts TIMESTAMP
)
PARTITION BY event_date
CLUSTER BY user_id
AS
SELECT
  user_id,
  DATE(event_ts) AS event_date,
  COUNT(*) AS total_events,
  SUM(IF(event_type = 'purchase', event_data.amount, 0)) AS total_revenue,
  COUNT(DISTINCT event_data.product_id) AS unique_products,
  MAX(event_ts) AS last_event_ts
FROM curated.events
WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_id, DATE(event_ts);

Best practices:

  • Landing layer: Raw data, minimal transformation, long retention
  • Curated layer: Cleaned, validated, typed, medium retention
  • Serving layer: Pre-aggregated, optimized for queries, short retention
  • Document lineage: Track data flow from landing -> curated -> serving

Pitfall 6: No Retention / Lifecycle

The Problem: Keeping all data forever leads to ever-increasing storage and query costs.

Why it’s expensive:

  • Storage costs: $0.02 per GB/month (adds up over years)
  • Query costs: Scans include old data you don’t need
  • Slower queries: More data = slower scans

The Fix: Implement table TTLs and partition expiration; set log retention by class (hot/warm/cold).

Example:

--  GOOD: Partition expiration for old data
CREATE TABLE curated.events (
  event_id STRING,
  user_id STRING,
  event_ts TIMESTAMP,
  data JSON
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id, event_type
OPTIONS (
  partition_expiration_days = 90,  -- Auto-delete partitions older than 90 days
  description = "Events table with 90-day retention"
);

--  GOOD: Table-level TTL for entire tables
CREATE TABLE staging.temp_results (
  result_id STRING,
  data JSON,
  created_at TIMESTAMP
)
OPTIONS (
  expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)  -- Delete after 7 days
);

--  GOOD: Different retention by data class
-- Hot data: 30 days (frequently queried)
CREATE TABLE serving.events_recent
PARTITION BY DATE(event_ts)
OPTIONS (partition_expiration_days = 30)
AS SELECT * FROM curated.events WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY);

-- Warm data: 90 days (occasionally queried)
CREATE TABLE curated.events
PARTITION BY DATE(event_ts)
OPTIONS (partition_expiration_days = 90)
AS SELECT * FROM landing.raw_events WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY);

-- Cold data: Archive to Cloud Storage (rarely queried)
-- Use BigQuery export jobs to move old partitions to GCS

Retention strategy:

  • Hot data (frequently queried): 7-30 days
  • Warm data (occasionally queried): 30-90 days
  • Cold data (rarely queried): Archive to Cloud Storage
  • Compliance data: Keep as required by regulations

Cost savings:

  • 90-day retention vs. forever: 75% storage cost reduction
  • Partition expiration: Automatic cleanup, no manual maintenance

Pitfall 7: No Cost Guardrails

The Problem: Without guardrails, runaway queries can cost thousands of dollars in minutes.

Why it’s expensive:

  • Unbounded scans: Queries can scan 100+ TB
  • No limits: One bad query can blow your budget
  • No alerts: You don’t know until the bill arrives

The Fix: Set per-dataset quotas, use reservations for steady workloads, implement cost controls (budgets, alerts).

Example:

--  GOOD: Per-dataset query size limit
-- Set via BigQuery console or gcloud:
gcloud alpha bq datasets update my_dataset \
  --default-table-expiration=3600 \
  --default-partition-expiration=86400 \
  --max-staleness=3600

--  GOOD: Per-query size limit (via query options)
SELECT *
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
OPTIONS (
  maximum_bytes_billed = 10000000000  -- 10 GB limit
);

--  GOOD: Per-user daily quota (via IAM)
-- Set in BigQuery console: IAM & Admin > Quotas
-- Limit: 1 TB per user per day

Cost control strategy:

  1. Per-dataset quotas: Limit total bytes scanned per day
  2. Per-user quotas: Limit individual user costs
  3. Per-query limits: Use maximum_bytes_billed in queries
  4. Budget alerts: Set up Cloud Billing alerts at 50%, 90%, 100%
  5. Reservations: For steady workloads, use flat-rate pricing

Budget alert example:

# Create budget alert at 90% of monthly budget
gcloud billing budgets create \
  --billing-account=YOUR_BILLING_ACCOUNT \
  --display-name="BigQuery Budget Alert" \
  --budget-amount=1000USD \
  --threshold-rule=percent=90 \
  --threshold-rule=percent=100 \
  --filter-projects=PROJECT_ID

Pitfall 8: CTE Performance Issues

The Problem: Using a single CTE in multiple downstream SELECTs can cause multiple executions, multiplying costs.

Why it’s expensive:

  • Multiple executions: CTE computed once per reference
  • No materialization: Results not cached between references
  • Higher complexity: Query planner may not optimize well

The Fix: For CTEs used multiple times, materialize to a temp table to avoid repeated execution.

Example:

--  BAD: CTE used multiple times (may execute multiple times)
WITH expensive_cte AS (
  SELECT
    user_id,
    event_type,
    COUNT(*) AS event_count
  FROM curated.events
  WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
  GROUP BY user_id, event_type
)
SELECT
  user_id,
  SUM(event_count) AS total_events
FROM expensive_cte
GROUP BY user_id

UNION ALL

SELECT
  event_type,
  SUM(event_count) AS total_events
FROM expensive_cte
GROUP BY event_type;

--  GOOD: Materialize to temp table (executes once)
CREATE TEMP TABLE temp_user_events AS
SELECT
  user_id,
  event_type,
  COUNT(*) AS event_count
FROM curated.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY user_id, event_type;

SELECT
  user_id,
  SUM(event_count) AS total_events
FROM temp_user_events
GROUP BY user_id

UNION ALL

SELECT
  event_type,
  SUM(event_count) AS total_events
FROM temp_user_events
GROUP BY event_type;

When to use temp tables:

  • CTE used 2+ times in the same query
  • Complex CTE (joins, aggregations, window functions)
  • Large intermediate results (> 1 GB)

Best practices:

  • Use CTEs for single-use, simple transformations
  • Use temp tables for multi-use, complex transformations
  • Monitor query execution to identify CTE performance issues

Quick win SQL (cap scan size & prove impact)

DECLARE max_bytes INT64 DEFAULT 10000000000; -- 10 GB
SELECT * FROM curated.fact_events
WHERE event_date BETWEEN @start AND @end
OPTIONS (max_bytes_billed = max_bytes);

Outcome: predictable cost, faster queries, cleaner analytics. Start with a 2-week readiness check, then a 4-week blueprint.


© Copyright 2017-2025