DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

How to design Spark pipelines for real-time derived data and ML scoring; with the right shuffle, skew, and storage choices. This article use real-time campaign segments as the example throughout.

Context: Many pipelines need derived views (e.g. aggregates, segments, attributes) in near real time from raw streams and reference data, plus AI-driven prediction (propensity scores, next-best-action, risk) on top. The architect’s job is to make both the derivation pipeline and the prediction pipeline correct, scalable, and operable; which means paying attention to shuffle, skew, storage layout, and how the prediction step fits into the same big data stack. This post explains how we think about that end to end, using real-time campaign segments as the running example.


End-to-end shape: raw -> curated -> campaigns/segments -> prediction -> serving

From a Spark/big data perspective, the pipeline looks like this:

  1. Raw
    Transactions and product/SKU data land from streaming sources (e.g. Kafka). A Spark Streaming job (or connector) writes to a raw layer partitioned by datetime (and optionally hour/source). No heavy shuffle here; just partition-aligned writes.

  2. Curated
    A Spark batch (or micro-batch) job reads raw by partition, deduplicates, joins to taxonomy (SKU -> category, brand, etc.). Taxonomy is typically small enough to broadcast; the main shuffle is on the transaction key (e.g. user_id, event_id) for dedupe and join. This is where skew first shows up (see below).

  3. campaigns and segments
    Business rules (e.g. “category affinity = top N categories by spend in last 90 days”) are implemented as Spark aggregations and window functions. Shuffle is by the grouping key (e.g. user_id). Output is written to a curated campaign/segment store (e.g. Parquet by dt and optionally user_id or segment id). Again, skew on a few heavy users can dominate runtime.

  4. Prediction (AI/ML)
    A scoring job (Spark batch or Structured Streaming) reads curated data + campaigns, builds features, runs a model (e.g. MLlib, or a loaded PMML/ONNX), and writes scores (propensity, churn risk, next-best-action) to a serving layer. Feature building often involves joins and groupBys -> shuffle; scoring itself can be partition-local if features are pre-joined. Storage layout of the input to scoring (e.g. one partition per user or per segment) drives how much shuffle you need.

  5. Serving
    Downstream systems read from tables or APIs. The architect’s concern is that the write path from Spark (partitioning, file size, format) matches how serving reads (e.g. by user, by segment, by date).

Below we focus on shuffle, skew, and storage in the curated -> campaigns -> prediction path, and where AI prediction fits in a Spark-centric design.


Shuffle in the campaign and segment pipeline

Where shuffle happens

  • Join curated to taxonomy: If taxonomy is small, broadcast it so the big table (transactions) is not shuffled to match it. Only the big side is partitioned; no shuffle for the small side.
  • Aggregations for campaigns: e.g. “spend per user per category in last 90 days” -> groupBy(user_id, category_id). Shuffle is by (user_id, category_id). Size spark.sql.shuffle.partitions so that each partition gets a few hundred MB to a few GB; repartition by the key you’ll use for downstream (e.g. user_id) before writing so that the next job (e.g. scoring) can read by user without reshuffling.
  • Window functions: e.g. “rank categories by spend per user” -> window.partitionBy("user_id").orderBy(desc("spend")). This shuffles by user_id. Again, if a few users have huge history, you get skew (see next section).
  • Deduplication: If you dedupe by (user_id, date) or similar, shuffle is by that key; skew is possible if some users have many more events.

Reducing shuffle

  • Broadcast taxonomy and any small lookup tables.
  • Partition alignment: If curated data is partitioned by dt, and campaign output is by dt and user_id, design the job so that it reads one partition (e.g. one day) and writes campaign output for that day. Then the next stage (e.g. scoring) can read “campaigns for dt = X” without shuffling on dt. Shuffle only on the keys that are necessary for the aggregation (e.g. user_id).
  • Single pass where possible: Compute multiple campaigns in one pass (e.g. one groupBy(user_id) with multiple aggregations) instead of multiple jobs that each shuffle by user_id.

Skew in campaigns and segments

Why skew appears

  • User skew: A small fraction of users (e.g. bots, power users, test accounts) have orders of magnitude more transactions. When you groupBy(user_id) or partitionBy(user_id) for windows, a few partitions get huge.
  • Category/product skew: A few categories or products dominate events; joins or groupBys on category can skew.
  • Segment skew: If you write segments by segment_id, a few segments (e.g. “all users”, “high value”) can be huge.

What we do in Spark

  • AQE skew join (Spark 3.x): Enable spark.sql.adaptive.enabled and spark.sql.adaptive.skewJoin.enabled. For joins, Adaptive Query Execution (AQE) can split skewed partitions at runtime. For aggregations, AQE can optimize join and coalesce; skew join helps when the skew is in a join.
  • Salting for groupBy/window: If a single groupBy(user_id) is skewed, salt the key: e.g. concat(user_id, "_", rand(0, N-1)), aggregate, then aggregate again by user_id to collapse. That spreads the hot keys across N partitions. Cost: two shuffles and more tasks, but balanced load.
  • Split hot and cold: Identify hot keys (e.g. from a pre-pass or from a list of known power users). Process hot keys in a separate job with salting or broadcast; process the rest in the main job; union. This keeps the main job stable and confines complexity to a small subset.
  • Broadcast for “small side” of skew: If the skewed side is actually a small set (e.g. top 10K users), you can filter them out, join/aggregate the rest, then handle the top 10K with a broadcast or a separate small job.

From an architect’s view: measure task duration distribution in Spark UI; if you see a long tail, enable AQE first, then add salting or split hot/cold for the stages that still skew.


Storage optimization for campaigns, segments, and prediction output

Partitioning

  • Curated transactions: Partition by dt (and optionally hour). Align with how you’ll reprocess (e.g. “recompute last 7 days” = read 7 partition dirs).
  • campaign/segment tables: Partition by dt so that “campaigns as of date X” is one partition. If serving reads by user, you can still store by dt and have a compact number of files per partition (see file size below). Avoid partitioning by user_id (millions of partitions) unless you use a bucketing scheme (e.g. bucket by user_id, 256 buckets, partition by dt).
  • Prediction scores: Same idea; partition by dt (or run_id) so that “scores for run at time T” is a single partition. Serving can then read the latest partition or merge by key.

File size and count

  • Target file size: Aim for roughly 128 MB-512 MB per file. Too many small files -> slow listing and many small reads; one huge file per partition -> poor parallelism. Use repartition(N) or coalesce(N) before write so that each partition (e.g. each dt) has about N files. N can be chosen so that partition_size / N is in that range.
  • Compaction: If the campaign or score table is appended to every run (e.g. every 15 minutes), you’ll accumulate many small files. Run a compaction job (e.g. daily): read the partition(s), coalesce to a target number of files, overwrite. That keeps read performance stable.
  • Overwrite by partition: For idempotent “refresh campaigns for last N days”, overwrite only those partitions (dynamic partition overwrite where supported). That avoids full-table rewrites and keeps storage predictable.

Format

  • Parquet for curated, campaigns, and scores: columnar, predicate pushdown (e.g. filter(dt = '...')), good compression. Use Snappy or Zstd. If the serving layer needs row-level lookups by key (e.g. user_id), you can still store as Parquet and use a key-value cache or a separate index that’s fed from the same Spark output; or use a table format (Delta, Iceberg) for upserts if you need to update scores in place.

Where AI-based prediction fits (Spark view)

Typical flow

  • Input: Curated transactions + product taxonomy + computed campaigns and segments (all in the same data lake or warehouse, often as Parquet).
  • Feature build: Spark job that joins and aggregates to build a feature table (e.g. one row per user with columns = features). This step can have significant shuffle (e.g. by user_id) and skew; same principles apply: broadcast small tables, AQE, salting or split if needed.
  • Model: Trained offline (e.g. with Spark MLlib, or external Python/R). The trained model (e.g. PMML, ONNX, or native Spark model) is stored in object storage or a registry.
  • Scoring: Spark batch job (or micro-batch) loads the model, reads the feature table (partitioned by dt or user_id), and applies the model. If the feature table is already partitioned by user (or by a key that matches the model’s expectation), scoring can be partition-local (no shuffle): each task reads a chunk of the feature table and writes a chunk of scores. That scales well.
  • Output: Scores (e.g. propensity_buy_X, churn_risk, next_best_action) are written to a table (e.g. Parquet by dt) or to a key-value store for low-latency serving. Writing is again optimized by repartitioning by the table partition key and controlling file size.

Freshness vs. cost

  • Micro-batch scoring: e.g. every 5-15 minutes, build features for “users with new activity or in campaign scope”, score them, update the serving layer. This keeps scores fresh without scoring the full user base every time. Shuffle is limited to the feature-build step; scoring itself can be a narrow read of the feature table and a partition-local transform.
  • Full refresh: Periodically (e.g. nightly), run a full feature build + score for all users so that the serving layer has a consistent snapshot. Same storage and shuffle discipline: partition by dt, control file size, compact if needed.

Governance

  • Track model version and training data window (e.g. which dt range was used for training). Store them in the score output or in a separate metadata table so that downstream and ops know what they’re using. Document which campaigns are rule-based vs. model outputs so that the pipeline remains auditable.

Late events and reprocessing

campaigns and segments often depend on event-time (e.g. “last 30 days”). Events can arrive late; taxonomy can be corrected.

  • Streaming: Use event-time and watermarking so that late events within the watermark are included; beyond that, drop or send to a side output for batch backfill.
  • Batch: Reprocess raw partitions for the affected window (e.g. last 24-48 hours), recompute curated and campaigns, then overwrite the corresponding campaign/segment partitions. Idempotent and consistent if keys and logic are deterministic. Same shuffle/skew/storage considerations as the main pipeline.

Summary: architect’s view

From a resident big data solution architect perspective, a real-time campaign and segment pipeline with AI prediction on Spark rests on:

  • Clear data layout: Raw and curated partitioned by dt (and optional hour/source); campaign and score tables partitioned by dt (or run_id); taxonomy broadcast so that joins don’t shuffle the small side.
  • Shuffle discipline: Broadcast small tables; size spark.sql.shuffle.partitions; repartition by the table partition key before write; single-pass aggregations where possible; feature-build job designed so that scoring can be partition-local.
  • Skew handling: AQE skew join and adaptive coalesce; salting or split hot/cold for groupBy/window/join stages that show skew in Spark UI.
  • Storage: Parquet, 128-512 MB target file size, compaction for append-heavy tables, overwrite-by-partition for idempotent reprocessing and merges.
  • Prediction: Feature build as a Spark job (with the same shuffle/skew/storage care); scoring as a partition-local read + model apply + partitioned write; micro-batch for freshness, full refresh for consistency; model and training metadata tracked for governance.

Treating the campaign pipeline and the prediction pipeline as one big data system; with consistent partitioning, controlled shuffle, explicit skew handling, and deliberate storage layout; is what makes the system scalable and operable at production scale. The same principles apply whether the downstream consumer is a campaign engine, a recommendation API, or an internal dashboard.

This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.

© Copyright 2017-2026