Batch vs Streaming

Batch and streaming compared through latency, operations, contracts, cost, ML serving, and product tradeoffs.

Related Wiki Pages

Data Engineering Data Engineering Platforms Data Pipelines Streaming DataOps Orchestration Apache Airflow Machine Learning System Design Machine Learning Infrastructure Data Quality and Observability Data Engineering Portfolio Projects

Batch processing handles bounded chunks of data. It covers scheduled warehouse jobs, backfills, training set creation, and batch inference. Streaming processing handles events as they arrive from queues, brokers, or production services. That makes batch vs streaming a data pipeline design question, not only a tool choice.

The distinction sits near streaming, data engineering platforms, DataOps, and machine learning system design when a pipeline feeds a model-backed product. Events may arrive through Kafka or Kinesis. Teams then choose whether to react immediately or store the data first and transform it later.^[1]

Mode Differences

Batch fits work where the consumer can wait and teams benefit from explicit dependencies. A batch job can declare upstream data and a time window. It can also declare downstream dependencies.^[2] That makes batch natural for Apache Airflow and warehouse transformations in the modern data stack. It also fits backfills, training datasets, and scheduled scoring. Orchestration covers the broader scheduling and dependency model behind those jobs.

Streaming fits work where delay changes the product outcome. Reporting, middle-latency stream processing, and sub-100-millisecond serving paths are different tiers. The tightest paths belong inside the serving application, not in a separate stream job.^[2]

The practical decision isn’t whether batch is old or streaming is modern. Name the downstream action and the latest useful arrival time. Then name the owner of the table or event interface and the checks for lag, bad data, stale ML features, and downstream breakage.

Boundary Differences

One view is skeptical of streaming as a default because dependency management is less explicit than in workflow-orchestrated batch. Batch can often reach minute-level windows, and sometimes second-level windows, before full stream processing is needed.^[2]

Another view treats batch or streaming as one processing-mode choice inside a larger production pipeline. Ingestion and queues still have to fit storage and orchestration. Spark or Flink processing then has to fit the same path ^[3].

Kretz’s practical split is immediate reaction from the queue versus storing first and processing later. Both are pipelines, and the difference is whether the consumer needs event-time action or can wait for a batch window.

A third view puts more weight on the organizational cost of streaming. Kafka adds onboarding work around schemas and registry practice. Teams also need allowed-change rules.^[4]

That keeps streaming close to data governance, Data Mesh, and data products, not only brokers and compute engines.

Latency and Product Action

Batch fits reports, warehouse models, and backfills. It also fits campaigns and model jobs where delayed results still support the decision. In batch inference, teams load and preprocess data. They also build features and write outputs.^[5] In ML platform design, batch inference often looks closer to training than to online serving. A workflow loads data, preprocesses it, runs training or inference, and writes an output artifact or prediction table ^[6].

Teams operate that structure with experiment tracking, model registries, and data quality and observability. Platform teams then choose a deployment mode. A scheduled batch job can write predictions for later use. Online serving exposes an API for request-time decisions.^[7]

Tool labels can hide the operating mode. A managed “batch” feature may spin up an online endpoint, send a large batch through it, and tear the endpoint down. That may work, but teams still need to check cost, performance, and whether the platform supports the batch mode they need ^[8].

Streaming fits event-arrival actions such as fraud checks, recommendations, and request-time enrichment. A fraud workflow can use daily batch jobs for feature values. The purchase flow still needs a live decision that can block a transaction.^[9]

Feature Stores make the latency split explicit because offline stores support training. Online stores serve low-latency features for fraud checks and recommendations. They also support risk, pricing, and ranking features.^[10]

When the useful response time is tighter than the streaming window, the logic belongs in the user-facing application path.^[2]

Operating Batch and Streaming Pipelines

Batch operations rely on dependency graphs and schedules. Reruns are normal operating work.^[11] Teams track dependencies, start times, and step order. Workflow orchestration helps repair late data, transient failures, and bugs.^[2] That operating model is central to DataOps.

Streaming operations center on brokers, consumers, schemas, and synchronized event flows. Continuously running infrastructure adds more operating concerns. A few Kafka topics can become many topics. Missing schemas break downstream jobs, so allowed-change rules and schema registry practice become part of daily operations.^[4]

Schemas, Ownership, and Replay

Bad upstream data, missing inputs, or dependency changes can break batch consumers. Teams need tests, orchestration, and arrival checks. They also need to know whether the data is fit for downstream use.

CDC and database versioning are part of the same dependency and change-management problem.^[2]

Use data quality and observability and DataOps for the adjacent reliability work.

Streaming moves those interface problems into event semantics. Software engineers may publish service events, and data teams consume the same topics for analytics.^[4]

Without Avro or another schema practice, the stream becomes a shifting interface. Registry lookup, allowed schema changes, and a change path make the interface usable. That’s why batch vs streaming also touches data governance, data products, and event tracking.

Cost and Platform Complexity

Batch is often cheaper to start and simpler to pause because scheduled compute doesn’t have to run continuously. Cost-aware orchestration comes before the streaming discussion. That includes cheaper serverless options. Much so-called streaming is micro-batching. Strict SLAs are what justify a specialized stack such as Kafka or Flink.^[12]

For ordinary reporting and analytics, five-minute batch or micro-batch runs may be enough. Kafka becomes easier to justify when a live product decision changes the outcome. Examples include fraud detection, dynamic pricing, ranking, and recommendations ^[13]. Teams should treat “real time” as a product requirement to prove, not as a default maturity badge for a data stack.

Streaming can earn its cost when delayed results lose value. It still increases operating cost and dependency work.^[2]

A team adopting Kafka may need people who understand cluster operation, producer and consumer conventions, schema practice, and onboarding.^[4]

Portfolio projects can start with a clear batch pipeline. Add Kafka only when the use case needs streaming.^[12]

Use Data Engineering Portfolio Projects for project examples that show why a latency requirement exists.

ML Features and Serving

ML systems often use both modes. Fraud systems may precompute stable features daily, then combine them with real-time payload information during a purchase.^[14]

That differs from a pure dashboard pipeline because the output can change the transaction while the customer is waiting.

Feast and Tecton ingest precomputed features from batch jobs or streams. They materialize offline and online stores, build point-in-time-correct training sets, and serve low-latency online features through a unified API.^[10]

Pure batch scoring or campaign use cases may not need a feature store. SQL, dbt, and validation may be enough. Those choices belong with MLOps, Machine Learning System Design, and machine learning infrastructure.

Hybrid Designs

The strongest examples are rarely pure batch or pure streaming. Teams split the system by stability and latency. Batch handles historical feature work, backfills, and training sets. Streaming or online paths handle current context, event-triggered actions, and low-latency retrieval. Fraud-prevention systems, feature stores, and ML platforms all show that split.^[15]^[10]^[7]

If a dashboard can wait until tomorrow, batch is probably enough. Streaming or online serving may be justified when a live purchase, ranking, pricing, or risk decision depends on the result. A hybrid feature path may also fit.

DataTalks.Club