Wiki

Streaming

Event streaming for real-time pipelines, Kafka architectures, schema management, feature stores, fraud systems, and search.

Related Wiki Pages

Data Pipelines Data Engineering Platforms CDC DataOps Data Quality and Observability MLOps Machine Learning System Design Search

Streaming data systems handle events close to the moment they’re produced. A producer writes events to a broker such as Kafka or Kinesis. SQS and RabbitMQ appear in the same queueing family. Consumers then transform those events for storage and dashboards. Other consumers use the same events for alerts, online features, fraud decisions, or search ranking.

Andreas Kretz places streaming inside data pipelines, not as the default architecture for every data problem ^[1]. Streaming sits beside batch vs streaming and DataOps. It also sits beside schema ownership, MLOps, and search when a delayed result loses product value.

Andreas Kretz gives a pipeline-level explanation in ^[1]. He uses website click events flowing into Kafka or Kinesis as the ingestion example. He then contrasts stream handling with batch work. Streaming reacts from the queue. Batch stores data first and handles it later ^[2].

Pipeline Anatomy

A streaming system in these discussions has four practical parts:

producers that emit events
a broker or queue that buffers events
jobs that transform or enrich events
outputs such as storage, online stores, alerts, applications, search indexes, or dashboards

Kretz maps those pieces in ^[1]. His pipeline anatomy includes ingestion and queues alongside compute frameworks, storage, and visualization.

Kretz also warns that too many tools create extra operating work. That warning matters because streaming adds always-on brokers and consumer jobs. It also adds schema checks, lag monitoring, and replay paths. A data engineering platform therefore needs topic naming and ownership. It also needs compatibility rules, observability, and conventions for replaying or backfilling data when consumers break.

Latency Boundaries

Lars Albertsson gives the clearest latency boundary in ^[3]. He separates slow reporting, streaming’s middle latency window, and sub-100-millisecond interactions that need data already inside the serving application. Streaming can react in seconds or minutes, but it still crosses multiple services and often includes internal batching.

Albertsson is also the strongest skeptic of streaming as a default. In ^[3], he argues that teams can often push batch latency down to minutes or seconds. They can still keep explicit dependencies and easier reruns. His view favors workflow-oriented batch when the product can tolerate the delay. It keeps streaming tied to recoverability rather than tool fashion.

Josh Fischer and Ning Wang’s book Grokking Streaming Systems gives a guided tour of the internal mechanics behind these systems. It covers watermarks, windows, and backpressure without tying them to a single framework.

Adrian Brudaru adds the modern data-stack warning in ^[4]. Many systems described as streaming are micro-batches unless strict service-level agreements justify Kafka, Flink, or similar infrastructure. Short batches or micro-batches can reduce latency while keeping bounded windows that engineers can test and rerun.

Kafka and Event Interfaces

Kafka appears as the concrete symbol for event streaming, but the guests don’t treat Kafka as the whole system. Kretz uses Kafka and Kinesis for click-event ingestion in ^[2]. Brudaru names Kafka and SQS as common buffers in ^[4]. He also puts Flink in the stricter streaming path, while warning that many “streaming” systems are micro-batch pipelines unless the SLA requires continuous event processing.

That 2025 tool view keeps the decision grounded. Kafka and SQS can buffer events, while Flink or DuckDB can process downstream data depending on the latency and state requirements. A team should call the system streaming only when the service-level agreement needs continuous event processing rather than short batch windows ^[4].

The broker gives producers and consumers a shared event path. A product service can publish one event, then consumers can use it for analytics and alerts. Other consumers can use the same event for ML features and search freshness. That separation only works when each consumer can understand the event and recover from late, duplicated, malformed, or replayed events.

Mehdi OUAZZA shows the failure mode in ^[5]. He warns that teams shouldn’t expect engineers with no Kafka experience to design a cluster under scale pressure. He also explains why topics and schemas become platform concerns. Software engineers may publish Kafka events for service-to-service communication. Data teams may consume the same topics into S3, Spark, or a warehouse.

Teams use typed schemas, a schema registry, allowed-change rules, and a documented schema-change path to turn an event stream into a shared data interface. Without those controls, each producer change can become a downstream parsing failure, data-quality incident, or compute-cost problem. The same ownership model links streaming to data products, data mesh, data governance, and data quality and observability.

CDC sits next to this interface problem. Database change capture can feed batch jobs, streaming consumers, warehouses, or event-driven services. Consumers still need to know what changed, whether the change is compatible, and how to replay or backfill when the source system changes.

Batch, Micro-Batch, and Event Time

Batch and streaming systems create different latency and recovery choices. Albertsson argues in ^[3] that batch windows make dependencies explicit: a job knows which upstream data and time interval it depends on. Streaming can hide dependencies in event arrival order, joins across streams, and synchronization between consumers.

The streaming decision starts with the action that consumes the result. Fraud blocking and operational alerts can justify low latency. Online features, search freshness, and traffic response can justify it too. Reports, backfills, training-set construction, and many warehouse models often fit batch. The Batch vs Streaming page covers that broader tradeoff.

Live experiments can also justify streaming when the experiment experience has to be assembled at exposure time. In the Bol.com favorite-brand validation, Abbaspour’s team wanted only employees to see the swiping page. The recommendations still depended on user-level calculations. The team used on-the-fly processing instead of precomputing recommendations for millions of users. That made targeting, product instrumentation, and experiment design part of the streaming decision (^[6] ^[7]).

Stream Engines and IoT Research

Kretz lists Spark and Flink as compute options in ^[1]. He also mentions Lambda and Glue jobs after saying the team should understand the schema, transformation steps, and desired output before choosing an implementation. Docker jobs appear in the same implementation discussion. Brudaru places Flink beside Kafka and SQS in ^[4]. He discusses micro-batching in the same section.

Eleni Tzirita-Zacharatou shows why hard streaming problems remain active research. In ^[8], she describes Nebula Stream as a general-purpose data management system for IoT. She also frames it as a research successor line after Apache Flink. IoT streams force systems to handle distributed data, resource limits, and application-specific algorithms at the same time.

Those research concerns surface in production systems too. Teams have to reason about event time, late arrivals, stateful joins, and delivery guarantees. They also have to account for backpressure and the cost of keeping stream infrastructure running continuously.

Fraud, Feature Stores, and Online ML

The strongest applied examples combine streaming and batch. Angela Ramirez explains this split in ^[9]. Daily batch jobs compute fraud features, while the live purchase flow calls a fraud system to decide whether to block a transaction. She returns to the same split: known calculations can be prepared ahead of time, while transaction-payload information must be handled almost immediately.

Willem Pienaar gives the feature-store version in ^[10]. He places feature stores between source systems and the production ML environment. Feature Stores can use raw streams, warehouses, and lakes. He separates streaming ingestion, batch transforms, and training-set construction.

Pienaar also separates serving-time behavior. That split helps teams avoid training-serving skew while still serving low-latency online features. For machine learning system design, the useful question isn’t whether all data should stream. It’s which features can be precomputed and which request-time signals must be handled live. Teams then validate both paths through MLOps, model monitoring, and machine learning infrastructure.

Search Freshness and Ranking Signals

Search systems use streaming ideas when relevance depends on fresh events and recent inventory. Current user behavior or changing ranking signals can create the same need.

Daniel Svonava frames search as a production decision problem in ^[11]. He discusses combining vector similarity with filters and recency. He also adds constraints, time encoding, normalization, and query-time weights.

Atita Arora connects modern search to personalization and learning-to-rank in ^[12]. She also connects search to vector databases and RAG. Those systems may not need a streaming framework for every update, but they often need reliable ingestion, freshness guarantees, and reindexing paths.

A search index, vector store, or recommendation candidate store is a consumer of product events. The design has to define how fresh results need to be and what happens when an event arrives late. It also has to define how embeddings or ranking features are recomputed, then how teams evaluate relevance after changes. Use search for the retrieval-specific side of that design.

Reliability and Operations

Streaming systems fail differently from scheduled jobs. A batch job can be late, missing, or wrong for a fixed window. A stream can lag, duplicate messages, handle events out of order, or keep running while silently changing a metric. Albertsson’s ^[3] discussion is useful because it names the recovery advantage of explicit batch windows.

The streaming version of DataOps needs lag monitoring and replay strategy. It also needs schema compatibility checks, consumer error alerts, and runbooks. OUAZZA supplies the schema side in ^[5].

Ramirez adds the production ML side in ^[9]. She discusses monitoring and runbooks. She also covers schema changes and upstream data problems. Pienaar adds feature validation and monitoring in ^[10].

The more consumers depend on a stream, the more the stream needs production ownership. Freshness checks, schema checks, and volume checks become part of the platform rules. Replay procedures and communication about event-rule changes do too.

Design Signals

A credible streaming design names the latency requirement before naming the tool. Kretz’s pipeline anatomy gives the basic structure. Name the producer and broker first. Then name the transformation job, storage, and output (^[1]). Albertsson’s comparison then asks whether streaming is truly needed or whether a short batch window would be easier to rerun (^[3]).

The strongest designs explain:

the action that needs low latency, such as fraud blocking, alerting, online feature retrieval, search freshness, or real-time traffic response
the event schema, versioning rules, and owner
handling for late events, duplicate events, malformed payloads, and replay
checks for lag, freshness, volume, errors, and downstream quality
why batch, micro-batch, or CDC alone wouldn’t meet the requirement

Those signals keep streaming grounded in speed and correctness. They also keep it grounded in recoverability, ownership, and product value.

DataTalks.Club