DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
How to design Spark pipelines for real-time derived data and ML scoring; with the right shuffle, skew, and storage choices. This article use real-time campaign segments as the example throughout.
Context: Many pipelines need derived views (e.g. aggregates, segments, attributes) in near real time from raw streams and reference data, plus AI-driven prediction (propensity scores, next-best-action, risk) on top. The architect’s job is to make both the derivation pipeline and the prediction pipeline correct, scalable, and operable; which means paying attention to shuffle, skew, storage layout, and how the prediction step fits into the same big data stack. This post explains how we think about that end to end, using real-time campaign segments as the running example.
From a Spark/big data perspective, the pipeline looks like this:
Raw
Transactions and product/SKU data land from streaming sources (e.g. Kafka). A Spark Streaming job (or connector) writes to a raw layer partitioned by datetime (and optionally hour/source). No heavy shuffle here; just partition-aligned writes.
Curated
A Spark batch (or micro-batch) job reads raw by partition, deduplicates, joins to taxonomy (SKU -> category, brand, etc.). Taxonomy is typically small enough to broadcast; the main shuffle is on the transaction key (e.g. user_id, event_id) for dedupe and join. This is where skew first shows up (see below).
campaigns and segments
Business rules (e.g. “category affinity = top N categories by spend in last 90 days”) are implemented as Spark aggregations and window functions. Shuffle is by the grouping key (e.g. user_id). Output is written to a curated campaign/segment store (e.g. Parquet by dt and optionally user_id or segment id). Again, skew on a few heavy users can dominate runtime.
Prediction (AI/ML)
A scoring job (Spark batch or Structured Streaming) reads curated data + campaigns, builds features, runs a model (e.g. MLlib, or a loaded PMML/ONNX), and writes scores (propensity, churn risk, next-best-action) to a serving layer. Feature building often involves joins and groupBys -> shuffle; scoring itself can be partition-local if features are pre-joined. Storage layout of the input to scoring (e.g. one partition per user or per segment) drives how much shuffle you need.
Serving
Downstream systems read from tables or APIs. The architect’s concern is that the write path from Spark (partitioning, file size, format) matches how serving reads (e.g. by user, by segment, by date).
Below we focus on shuffle, skew, and storage in the curated -> campaigns -> prediction path, and where AI prediction fits in a Spark-centric design.
Where shuffle happens
groupBy(user_id, category_id). Shuffle is by (user_id, category_id). Size spark.sql.shuffle.partitions so that each partition gets a few hundred MB to a few GB; repartition by the key you’ll use for downstream (e.g. user_id) before writing so that the next job (e.g. scoring) can read by user without reshuffling.window.partitionBy("user_id").orderBy(desc("spend")). This shuffles by user_id. Again, if a few users have huge history, you get skew (see next section).(user_id, date) or similar, shuffle is by that key; skew is possible if some users have many more events.Reducing shuffle
dt, and campaign output is by dt and user_id, design the job so that it reads one partition (e.g. one day) and writes campaign output for that day. Then the next stage (e.g. scoring) can read “campaigns for dt = X” without shuffling on dt. Shuffle only on the keys that are necessary for the aggregation (e.g. user_id).groupBy(user_id) with multiple aggregations) instead of multiple jobs that each shuffle by user_id.Why skew appears
groupBy(user_id) or partitionBy(user_id) for windows, a few partitions get huge.segment_id, a few segments (e.g. “all users”, “high value”) can be huge.What we do in Spark
spark.sql.adaptive.enabled and spark.sql.adaptive.skewJoin.enabled. For joins, Adaptive Query Execution (AQE) can split skewed partitions at runtime. For aggregations, AQE can optimize join and coalesce; skew join helps when the skew is in a join.groupBy(user_id) is skewed, salt the key: e.g. concat(user_id, "_", rand(0, N-1)), aggregate, then aggregate again by user_id to collapse. That spreads the hot keys across N partitions. Cost: two shuffles and more tasks, but balanced load.From an architect’s view: measure task duration distribution in Spark UI; if you see a long tail, enable AQE first, then add salting or split hot/cold for the stages that still skew.
Partitioning
dt (and optionally hour). Align with how you’ll reprocess (e.g. “recompute last 7 days” = read 7 partition dirs).dt so that “campaigns as of date X” is one partition. If serving reads by user, you can still store by dt and have a compact number of files per partition (see file size below). Avoid partitioning by user_id (millions of partitions) unless you use a bucketing scheme (e.g. bucket by user_id, 256 buckets, partition by dt).dt (or run_id) so that “scores for run at time T” is a single partition. Serving can then read the latest partition or merge by key.File size and count
repartition(N) or coalesce(N) before write so that each partition (e.g. each dt) has about N files. N can be chosen so that partition_size / N is in that range.Format
filter(dt = '...')), good compression. Use Snappy or Zstd. If the serving layer needs row-level lookups by key (e.g. user_id), you can still store as Parquet and use a key-value cache or a separate index that’s fed from the same Spark output; or use a table format (Delta, Iceberg) for upserts if you need to update scores in place.Typical flow
propensity_buy_X, churn_risk, next_best_action) are written to a table (e.g. Parquet by dt) or to a key-value store for low-latency serving. Writing is again optimized by repartitioning by the table partition key and controlling file size.Freshness vs. cost
Governance
campaigns and segments often depend on event-time (e.g. “last 30 days”). Events can arrive late; taxonomy can be corrected.
From a resident big data solution architect perspective, a real-time campaign and segment pipeline with AI prediction on Spark rests on:
dt (and optional hour/source); campaign and score tables partitioned by dt (or run_id); taxonomy broadcast so that joins don’t shuffle the small side.spark.sql.shuffle.partitions; repartition by the table partition key before write; single-pass aggregations where possible; feature-build job designed so that scoring can be partition-local.Treating the campaign pipeline and the prediction pipeline as one big data system; with consistent partitioning, controlled shuffle, explicit skew handling, and deliberate storage layout; is what makes the system scalable and operable at production scale. The same principles apply whether the downstream consumer is a campaign engine, a recommendation API, or an internal dashboard.
This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.