Wiki

Data Pipelines

Guide to data pipelines: ingestion, transformation, publication, orchestration, testing, recovery, CDC, and ML handoffs.

Related Wiki Pages

ETL ELT ETL vs ELT CDC Orchestration DataOps DataOps Checks for Data Pipelines Data Quality and Observability Data Engineering Platforms MLOps Notebook to Production Workflow

Data pipelines move data from source systems into forms that people, products, and models can use. A pipeline is more than a scheduled job. It extracts or receives data and stores enough raw history to recover. It transforms data into modeled outputs and publishes them. It also gives the team a way to test, observe, and rerun the work.

The modern analytics version separates extraction and loading from warehouse-side transformation, then connects that approach to data marts and data lakes. Orchestration, CDC, and reverse data flows sit around those storage choices (^[1]). When the pipeline stores lakehouse tables, use Apache Iceberg for the open table-format and catalog boundary. Use Delta Lake vs Apache Iceberg when the pipeline design has become a table-format choice rather than a general ingestion or transformation question.

The same map extends further because ingestion and orchestration come before modeling. Transformation, analytics outputs, and production ML handoffs belong in the same conversation (^[2]).

This topic covers pipeline design. Use ETL vs ELT for the transformation boundary between ETL and ELT, while Orchestration and Apache Airflow cover scheduling and dependencies. Use DataOps for reliable delivery practice. Use DataOps Checks for Data Pipelines for the concrete checks that protect a pipeline change. Use Data Engineering Platforms for shared infrastructure around many pipelines.

Use dataops vs data engineering when the question is whether the team needs pipeline design work or stronger release, monitoring, and recovery practice.

Use how to build data pipelines when teams need to connect consumer needs to delivery through ingestion, modeling, orchestration, and checks.

Movement, Transformation, and Publication

A useful data pipeline has three responsibilities.

Movement: the pipeline gets data out of source systems, APIs, files, event streams, databases, or application logs and writes it to durable storage.
Transformation: the pipeline cleans, joins, deduplicates, masks, aggregates, or models the data so downstream consumers can use it.
Publication: the pipeline serves the result as a table, mart, dashboard, feature set, search index, model input, or API-facing output. It may also sync the result back into an operational system.

Publication is part of the pipeline. A table that loads successfully but breaks a dashboard, model, or business workflow is still a pipeline failure. A successful engineering job isn’t the same as useful data. Teams use freshness, volume, and distribution to see whether the output still works. Schema and lineage show downstream impact (^[3]).

That definition also explains why pipeline work touches several roles. Analytics engineers may own dbt models and marts. Data engineers may own ingestion, storage, orchestration, and recovery. ML engineers may own feature jobs, training data, and serving handoffs.

A pipeline usually separates raw, staged, modeled, and serving layers. Raw data preserves source behavior for replay and backfills. The staging layer cleans names, types, and obvious source-system issues.

Kretz gives the production ML version as a sequence from ingestion to visualization. Click events may land in Kafka or Kinesis, then move through stream or batch processing before storage and product use.

That keeps queues and processing mode in the same design. Serving output still belongs in that pipeline boundary (^[4], ^[5]).

The modeled layer represents business entities and facts, plus dimensions and metrics. It also represents features. Serving outputs feed marts and dashboards. They can also feed feature tables, indexes, APIs, or reverse ETL syncs.

The beginner version stays grounded in Python and SQL, plus Docker, Airflow, and data warehouses (^[6]). The tools matter because a pipeline has to be readable, testable, and maintainable by another engineer.

Ingestion and Change Capture

Ingestion starts the pipeline, but it doesn’t decide the whole architecture. Extraction and loading can come before warehouse-side transformation. Teams keep raw data close to the destination and put business logic in SQL models when that fits the organization (^[1]).

IoT work gives the full pipeline boundary in a compact form. Sensor data flows from installed devices and loggers into an ETL step, then into a database and reporting layer. In that setting, the same person may configure data collection and load the records. They also make the result usable for structural-health monitoring (^[7]).

For ML-facing pipelines, ingestion can begin before connector work. CRISP-DM treats data collection as part of data understanding rather than as its own named step. Pipeline design then has to ask whether important data is missing. If it’s missing, the team may need new collection work. It may also need infrastructure, labeling, or Data Quality and Observability before modeling ^[8].

That makes data collection an explicit pipeline risk even when the project methodology names only data understanding and preparation.

Teams also can’t treat ingestion as an afterthought: raw storage needs guardrails. Warehouses and lakes have different strengths, and schema evolution changes downstream assumptions.

CDC is one ingestion technique, not a separate pipeline type. It captures changed rows instead of copying the whole source table again. The first load gives the destination a baseline. Later syncs move inserts, updates, and deletes so the destination stays current without rewriting everything (^[1]).

Deduplication, ordering guarantees, and PII masking sit close to ingestion. Those checks protect later models and marts from source-system noise (^[2]). This is where pipeline design crosses into governance. If the source sends duplicate or out-of-order records, the transformation layer may still run, but the output may no longer represent the business event correctly.

Transformation and Modeling

Transformation turns stored data into outputs downstream consumers can understand. In analytics pipelines, that often means SQL models and joins. It can also mean type conversions, business metrics, and marts.

ELT can give analysts more autonomy once raw data is in the warehouse (^[1]).

Modeling is the point where engineers translate entities, relationships, foreign keys, and business questions into outputs. The work moves from ingestion into modeled marts and dashboards, then into ML-specific feature engineering, training, and serving (^[2]). That progression matters because the same upstream data can feed different publication paths.

Scientific catalogs use the same pipeline step with different keys. In astroinformatics scientific data pipelines, Daniel Egbo matches radio detections against optical and infrared catalogs. The “join key” is a measured sky position with uncertainty rather than a stable business identifier ^[9].

A dashboard may need one freshness target. A feature store or model-training job may need a different structure and auditability level.

For ML and AI systems, transformation includes feature engineering and production handoffs. In a fraud-prevention pipeline, daily jobs compute stable fraud features while live transaction signals feed real-time decisions at checkout (^[10]). Use Batch vs Streaming for the latency decision and ML pipelines for the larger model lifecycle.

Orchestration and Publication

Orchestration coordinates pipeline work after the steps are clear. Airflow sits at the scheduling layer beside Airbyte-style ingestion and dbt-style transformation (^[1]). Airflow can run a connector sync and trigger transformations. It can also sequence checks, but it shouldn’t hide the business logic inside a tangle of tasks. The pipeline remains easier to review when ingestion, transformation, checks, and publication each have an explicit role.

For a local Airflow project, DataTalks.Club’s lightweight local Airflow with Docker Compose tutorial keeps the scheduler, UI, and metadata database visible. It also keeps the DAG folder and logs visible.

A production pipeline anatomy starts with ingestion and buffering, then moves to transforms, storage, and visualization. SQL or dataframe transforms fit into that anatomy. Airflow or simpler schedulers and model-serving options do too (^[11]). The practical advice is to start simple and add Airflow, Kubernetes, or heavier infrastructure when the dependencies justify it.

When an ML notebook becomes the starting point, the data-pipeline boundary is still the repeatable path from inputs to published outputs and recovery. Notebook to Production Workflow covers the surrounding handoff sequence. It starts with the decision and reusable code. Then it builds the data and feature path before evaluation, serving, monitoring, and feedback ^[12].

Publication closes the pipeline with a warehouse table, mart, or dashboard. It can also be a model artifact, feature set, prediction API, or reverse data flow back into an operational system. Reverse data flows mean the pipeline may not end inside the warehouse. It may send modeled data back to business tools when sales, marketing, or operations teams need it (^[1]).

Testing, Recovery, and Observability

Reliable pipelines are operated systems, not scripts that happen to run on a schedule. Christopher Bergh anchors that operating model in ^[13] and ^[14]. He connects pipeline quality to version control, tests, CI/CD, and observability. He also adds automated runbooks, realistic test data, and deployment confidence.

Data tests need to cover both code and data behavior. Bergh mentions dbt, Great Expectations, SQL tests, and test strategies in ^[13]. Ramirez gives the applied data-engineering version for PySpark jobs, cloud monitoring, and schema changes. She also covers job failures, runbooks, and error documentation (^[10]).

Observability catches failures that task status alone misses. Barr Moses names freshness, volume, and distribution in ^[3]. She also adds schema and lineage, then separates detection from diagnosis. That distinction matters for pipelines because the team needs to find the cause of a late table.

The cause may sit in an upstream source or ingestion connector. It may also be a transformation bug, a schema change, or publication. Use Data Quality and Observability for those reliability signals.

Teams should design recovery into the pipeline. Useful pipelines keep enough raw or intermediate state to backfill, replay, or compare outputs after a change. CDC feeds need checkpoints and delete handling. Batch jobs need rerunnable windows, and streaming jobs need lag and replay monitoring. ML feature pipelines need a way to connect training data, online features, and production outcomes.

Batch, Streaming, and CDC

Batch vs Streaming is a latency and operating decision. Kretz introduces events and queues in ^[11], then contrasts streaming and batch. Streaming helps when a system must react to events as they arrive. Batch helps when a bounded run is easier to reason about, cheaper to operate, and fresh enough for the consumer.

In Ramirez’s fraud-detection system, daily batch jobs prepare stable network and member features. The checkout path still needs instant inference for a transaction (^[10]). That’s stronger than “stream everything” because it names which part of the decision needs low latency.

Mehdi OUAZZA adds the team-scale cost of streaming in ^[15]. He connects Kafka to schemas and schema registries. He also discusses explicit producer-consumer agreements. Those conventions keep consumers from breaking when producers change events. Streaming pipelines therefore need platform standards, not only a broker.

CDC sits between full reloads and event streaming. It can keep a warehouse or lake current with row-level changes without forcing every downstream consumer to operate as a streaming application. It still needs checkpoints, schema handling, deduplication, and recovery. Treat CDC as an ingestion strategy that feeds a pipeline. Then decide separately whether the downstream work is batch, micro-batch, streaming, or request-time serving.

For bounded file-backed SQL work, DuckDB fits the same simple-first pipeline path. It works when Parquet or lake files are enough and always-on orchestration isn’t justified (^[16]).

Platform Conventions

One pipeline can live as a small repo, but many pipelines need a platform. The platform supplies shared storage, orchestration, secrets, and deployment paths. It also supplies lineage, monitoring, access control, and reusable conventions. That’s why this topic sits next to Data Engineering Platforms.

Mehdi OUAZZA gives the scale-up version. In ^[15], the data platform enables self-service, onboarding, and scalability.

Airflow and shared conventions are part of that platform, and playbooks and best practices belong there too. A split between platform work and use-case pipelines helps teams avoid rewriting the same orchestration, access, and recovery rules for every project.

Reusable ingestion, transformation, and datamart templates put platform discipline inside individual pipelines. They work best when they reuse production-proven pieces such as API ingestion into bronze, merges into silver, or shared geography dimensions. They still need room for project-specific logic (^[17]).

Cloud-native storage conventions matter when the pipeline works over dense imagery instead of ordinary tables. Daynan Crull contrasts cloud-native access with local downloads that make analysts manage massive image files.

He names Cloud Optimized GeoTIFFs, or COGs, from Earth observation. He also names STAC-style asset catalogs as a better storage and query approach. The data stays close to cloud compute. Analysts query only the relevant tiles, so the pipeline avoids downloading or cutting whole files before analysis ^[18]. That convention links pipeline design to Data Engineering Platforms, storage layout, analyst-facing query access, and astroinformatics scientific data pipelines.

Paul Iusztin and Mariano Semelman extend the platform conventions into AI systems. Paul frames the AI engineer as a full-stack role that has to ship products, not only prototypes (^[19]). Mariano focuses on end-to-end ownership and business requirements. He also discusses feedback and the declining role of notebooks in production (^[12]). For data pipelines, their shared implication is that product systems need a repeatable path from data and prompts or features into production behavior.

Design Tradeoffs

Pipeline design follows the same broad lifecycle even when each use case applies different design pressure. Kwong’s ^[1] puts the extraction and loading boundary first. That makes ETL vs ELT a pipeline decision rather than only a tooling label. After teams choose that transformation boundary, the wider lifecycle still runs from ingestion through publication plus recovery and reliability.

Tuli’s ^[2] starts with ingestion choices before ordering, deduplication, and PII masking. Modeling and marts come later, followed by dashboards and ML handoffs. Together, these examples connect storage choices and early data handling to the team’s ability to change the pipeline safely.

Reliability changes the tradeoff from job status to output usefulness. Bergh’s DataOps work in ^[13] and ^[14] frames reliable pipeline delivery around version control and tests as team practice. CI/CD, observability, and recovery runbooks make the same practice usable in production.

Moses adds the downstream view in ^[3] because a green run can still publish stale, partial, shifted, or schema-breaking data. Use Data Quality and Observability for freshness, volume, or distribution signals. Schema plus lineage helps show which consumers may break and where the cause sits.

Production pipelines also differ by latency and ownership. Kretz’s ^[11] puts ingestion plus buffering before later work. Transforms and storage come next. Visualization and serving follow.^[4] His practical line is to keep the first production version simple enough to operate. Ramirez’s ^[10] uses daily feature jobs beside live checkout decisions, so Batch vs Streaming depends on the decision that consumes the data.

Mehdi OUAZZA adds self-service onboarding and Airflow standards in ^[15]. He also covers Kafka schemas and producer-consumer agreements, which link individual pipelines to Data Engineering Platforms.

Katz keeps the foundation concrete in ^[6] by making Python and SQL the base. Docker and Airflow support day-to-day work beside warehouses and tests, while small functions plus classes make pipeline code easier for another engineer to maintain.

Fundamentals of Data Engineering by Joe Reis and Matthew Housley frames this same pipeline lifecycle across ingestion, transformation, and serving layers.

Adjacent Topics

Use ETL vs ELT when the question is where transformations should run. Use ETL or ELT when the question is one lifecycle rather than the comparison. Use CDC when the source data changes incrementally and full reloads are wasteful. Use Orchestration and Apache Airflow when the problem is scheduling, dependencies, retries, or backfills ^[1] ^[2].

Use DataOps when the concern is version control and tests. It also covers CI/CD, observability, and recovery ^[20].

Use DataOps Checks for Data Pipelines when the concern is pipeline-level check design. It covers freshness, volume, schema, and distribution checks. It also covers uniqueness, lineage, and runbooks. Use Data Quality and Observability when the concern is freshness, volume, or distribution. It also covers schema, lineage, SLAs, and runbooks.

Use Data Engineering Platforms when the same conventions have to support many teams and many pipelines.

Use Batch vs Streaming when latency and replay drive the design. It also covers cost and operations. Use MLOps and ML pipelines when the pipeline publishes features, training data, or model artifacts. They also apply when the pipeline publishes online predictions or feedback data.