The client wants everything in real-time. They envision AI predictions appearing instantaneously as data flows through their systems. But when you dig into the actual business requirements, you discover that their churn predictions are used in weekly marketing meetings, their demand forecasts inform monthly planning cycles, and their anomaly detection reviews happen each morning. None of these use cases require real-time processing. They need batch processing that feels current โ predictions generated overnight and available when users need them.
The batch vs. real-time architecture decision is one of the most impactful choices in enterprise AI system design. Real-time processing is more complex, more expensive, and harder to maintain. Batch processing is simpler, cheaper, and easier to debug. Choosing the wrong pattern โ usually over-engineering toward real-time when batch is sufficient โ wastes budget, increases maintenance burden, and delays time to value.
Understanding the Processing Patterns
Batch Processing
Batch processing runs predictions on a schedule โ hourly, daily, weekly, or on-demand. A batch job processes a set of inputs, generates predictions for all inputs, and stores the results for later retrieval.
How it works: A scheduled job pulls input data from a data warehouse or data lake, runs inference on all records, and writes predictions to a results table or feature store. Users and applications query the results table when they need predictions.
Characteristics:
- Processing happens on a schedule, not on-demand
- All inputs are processed together
- Results are stored and served from a database
- Latency is minutes to hours (time since last batch run)
- Cost scales with data volume per run
Common batch architectures:
- Scheduled Spark or Python jobs on a cluster
- Airflow-orchestrated pipeline
- SageMaker Processing or Batch Transform
- Cloud Functions triggered on a schedule
- dbt models that compute features and predictions
Real-Time (Online) Processing
Real-time processing generates predictions on-demand in response to individual requests. An application sends input data to an API, the model processes the input, and returns the prediction within milliseconds to seconds.
How it works: A model is deployed as an API endpoint. When a request arrives, the service loads features, runs inference, and returns the prediction. Features may be pre-computed and cached or computed in real-time.
Characteristics:
- Processing happens on-demand per request
- Each input is processed individually
- Results are returned immediately
- Latency is milliseconds to seconds
- Cost scales with request volume
Common real-time architectures:
- Model serving frameworks (TensorFlow Serving, TorchServe, Triton)
- Custom API (FastAPI, Flask) with model loaded in memory
- SageMaker Endpoints
- Managed ML serving (Vertex AI Prediction, Azure ML Endpoints)
Near-Real-Time (Streaming) Processing
Streaming processing is a middle ground โ processing events as they arrive in a continuous stream, typically with latency of seconds to minutes.
How it works: Events flow through a message queue (Kafka, Kinesis). A stream processing application consumes events, enriches them with features, runs inference, and produces predictions as output events.
Characteristics:
- Processing happens continuously as events arrive
- Each event is processed individually or in micro-batches
- Latency is seconds to minutes
- Handles high-throughput event streams
- More complex than batch, simpler than synchronous real-time for high-volume use cases
Common streaming architectures:
- Kafka + Flink/Spark Streaming + model inference
- Kinesis + Lambda + SageMaker endpoint
- Cloud Pub/Sub + Dataflow + model serving
Choosing the Right Pattern
Decision Framework
Answer these questions to determine the appropriate processing pattern:
1. What is the acceptable prediction latency?
- Minutes to hours โ Batch
- Seconds to minutes โ Streaming
- Milliseconds โ Real-time
2. How frequently does the user need updated predictions?
- Daily or less โ Batch
- Every few minutes โ Streaming
- Every request โ Real-time
3. What triggers a prediction?
- A scheduled time โ Batch
- An incoming event or data update โ Streaming
- A user action or application request โ Real-time
4. How many predictions are needed per period?
- All records at once (bulk) โ Batch
- Continuous stream of events โ Streaming
- Individual, on-demand requests โ Real-time
Use Case Analysis
Churn prediction โ Batch: Predictions are consumed in weekly or monthly reviews. Processing all customers overnight and making results available in the CRM each morning is perfectly adequate. Real-time churn scoring adds complexity without business value.
Fraud detection โ Real-time: Each transaction must be scored before it is approved or declined. A fraud score that arrives 10 minutes after the transaction is useless. Real-time inference is required.
Demand forecasting โ Batch: Forecasts inform purchasing, inventory, and staffing decisions made on daily or weekly cycles. Batch processing aligned to the planning cadence is appropriate.
Recommendation engine โ Real-time or hybrid: Homepage recommendations can be pre-computed (batch) and served from cache. In-session recommendations that adapt to current browsing behavior require real-time scoring. Most production recommendation systems are hybrid โ batch for cold-start and periodic updates, real-time for session adaptation.
Anomaly detection in IoT โ Streaming: Sensor data arrives continuously. Anomalies must be detected within minutes to prevent equipment damage. Streaming processing handles the continuous data flow with acceptable latency.
Lead scoring โ Batch or near-real-time: Batch scoring overnight is sufficient if leads are reviewed daily. Real-time scoring may be warranted if leads are routed to sales reps immediately upon submission.
Content moderation โ Real-time: User-generated content must be screened before it is visible to other users. Even a few minutes of visibility for harmful content is unacceptable. Real-time inference is required.
Architecture Considerations
Feature Engineering
The feature engineering approach differs significantly between batch and real-time:
Batch features: Computed over historical windows using SQL, Spark, or Python processing. Can involve complex aggregations, joins across multiple tables, and historical lookbacks. Batch features are straightforward to compute because all data is available.
Real-time features: Must be computed in milliseconds. This limits the complexity of feature engineering โ you cannot run a 30-second SQL query during a real-time prediction request. Real-time features typically come from:
- Pre-computed features stored in a feature store or cache
- Simple computations on the input data
- Event-based aggregates maintained in a streaming system
Feature store: For systems that need both batch and real-time features, a feature store (Feast, Tecton, Databricks Feature Store) provides a unified interface. Features are computed in batch and served with low latency for real-time requests.
Model Serving
Batch model serving: Load the model, process all inputs, and shut down. Model loading time is amortized across all inputs. Can use larger models because latency per prediction is not critical.
Real-time model serving: Model stays loaded in memory, ready to process individual requests. Model loading happens once at startup, and inference latency is critical. Larger models may require GPU serving or model optimization (quantization, distillation) to meet latency requirements.
Scaling:
- Batch: Scale compute up for the processing window, then scale down. Cost is proportional to processing time per run.
- Real-time: Maintain always-on instances to handle incoming requests. Scale horizontally based on request volume. Cost is proportional to uptime plus traffic volume.
Cost Comparison
Real-time serving is typically 3-10x more expensive than batch processing for the same volume of predictions because:
- Always-on instances incur continuous cost, even during low-traffic periods
- GPU instances for real-time serving are expensive
- Redundancy requirements (multiple instances for availability) multiply the base cost
- Feature store and caching infrastructure add additional cost
- Monitoring and alerting for real-time systems are more complex
For a system scoring 100,000 customers:
- Batch: A scheduled job runs for 30 minutes on a moderate compute instance once per day. Cost: approximately $50-$200/month.
- Real-time: An always-on endpoint with auto-scaling handles requests throughout the day. Cost: approximately $500-$3,000/month depending on traffic patterns and GPU requirements.
Reliability and Monitoring
Batch reliability: If a batch job fails, you re-run it. Stale predictions (from the last successful run) are available as a fallback. Batch failures are visible, debuggable, and recoverable.
Real-time reliability: If the serving endpoint goes down, predictions are unavailable. Applications that depend on real-time predictions may fail or degrade. Real-time systems require redundancy, health checks, automatic failover, and alerting โ all of which add operational complexity.
Monitoring:
- Batch: Monitor job completion, processing time, output quality, and data freshness.
- Real-time: Monitor endpoint availability, latency percentiles (p50, p95, p99), error rates, throughput, and model quality โ continuously.
The Hybrid Pattern
Most production AI systems use a combination of batch and real-time processing.
Pre-compute in batch, serve in real-time: Compute predictions for all entities in batch, store them in a low-latency database, and serve them via API. This combines the simplicity of batch computation with the responsiveness of real-time serving.
Batch for baseline, real-time for adjustment: Compute baseline predictions in batch and adjust them in real-time based on new information. A recommendation system might compute base recommendations overnight and adjust them based on the current session's click behavior.
Batch for training, real-time for inference: Train models in batch using historical data. Deploy trained models to real-time endpoints for online inference.
Cold-start batch, warm-path real-time: For new entities (new users, new products), serve pre-computed default predictions from batch. For entities with sufficient interaction history, compute personalized predictions in real-time.
Client Communication
Setting Expectations
Help clients understand the trade-offs:
"Real-time processing gives you immediate predictions, but it costs 5-10x more to operate and takes longer to build. Batch processing gives you predictions that are hours old but is simpler, cheaper, and more reliable. For your use case โ weekly marketing campaigns based on churn predictions โ batch processing delivers the same business value at a fraction of the cost and complexity."
Avoiding Over-Engineering
Enterprise clients often default to "we want real-time" because it sounds better. Push back when batch is sufficient:
"I recommend we start with batch processing. This gets your team using predictions within 4 weeks instead of 10 weeks. If we find that the business process requires fresher predictions, we can upgrade to real-time later โ and the model development work is reusable. Starting simple means you see value sooner and spend less upfront."
The right architecture is the simplest one that meets the business requirements. Batch when you can, real-time when you must, and hybrid when the use case demands both. The agencies that help clients make this decision wisely deliver systems that are cost-effective to operate, reliable in production, and straightforward to maintain โ which is exactly what enterprise clients need.