Guide

Airflow: When Data Teams Need Workflow Orchestration

A podcast-backed guide to Airflow as workflow orchestration for data pipelines, analytics stacks, platform teams, and batch ML workflows.

Related Wiki Pages

Data Engineering Data Engineering Platforms Modern Data Stack DataOps Data Quality and Observability Batch vs Streaming

Airflow is useful when a data team has recurring jobs that depend on each other. In the DataTalks.Club archive, guests discuss it as an orchestrator. It schedules work, runs dependent steps, and gives teams a place to look at pipeline runs. It doesn’t store data or define business logic (Natalie Kwong, Data Engineering Tools and Modern Data Stack, 30:59).

A team may use Airflow to trigger ingestion and run transformations. It may also call Spark jobs and notify an owner. Each underlying tool still owns the work it runs (Modern Data Stack, Data Engineering Tools). Airflow becomes a good fit when those steps are hard to run, retry, backfill, or explain as separate scripts.

For a deeper Apache Airflow production guide, use Apache Airflow. For local learning and DAG development, use Airflow Docker Compose.

Airflow Jobs

Airflow gives data workflows a control plane, and teams define task graph metadata, retries, and alerts. Airflow records which runs succeeded, which tasks failed, and which downstream steps are blocked (Apache Airflow, Data Engineering Platforms).

In Natalie’s modern stack episode, Airflow appears after she maps the modern stack. Around 30:59, she explains Airflow as orchestration around those tools. Around 31:31, she connects Airbyte with extract-load work and dbt-style transformation (Natalie Kwong, Data Engineering Tools and Modern Data Stack).

So the practical Airflow question isn’t whether the team wants a familiar tool. The question is whether the team needs one place to manage schedule state and dependency state. Retries, backfills, and run visibility belong in that decision too (Orchestration, DataOps).

Stack Fit

In an analytics stack, Airflow usually sits between ingestion and consumption. It can start connector jobs and warehouse transformations, then trigger tests and publish modeled tables (Modern Data Stack, Data Activation). Those tables may feed dashboards, activation, or product analysis.

That doesn’t make Airflow the transformation layer. In the same modern stack discussion, Natalie separates Airbyte-style extract-load work from dbt-style transformation. She then places Airflow around the workflow (Data Engineering Tools and Modern Data Stack, 30:59-31:31).

Use Airflow to coordinate the steps. Keep metric definitions and SQL models in systems where the team can review them. Keep Python modules and Spark jobs there too, so the team can review and test them (dbt, ETL vs ELT).

The same split shows up in pipeline architecture. Santona Tuli discusses workflow authoring through Airflow and Astronomer around 7:08. Around 26:43, she places orchestration next to Spark and Kafka. She also discusses feature stores and vector databases (Modern Data Pipeline Architecture).

Airflow coordinates a pipeline whose responsibilities are already clear, but it doesn’t make unclear ownership disappear.

Good Fit

Airflow earns its cost when a workflow has dependencies, ownership, recovery, and history that people need to share. Natalie covers analytics pipelines that combine Airbyte and dbt (Natalie Kwong, Data Engineering Tools and Modern Data Stack). Mehdi covers platform teams that need reusable Airflow conventions (Mehdi OUAZZA, Scaling Data Engineering Teams, 17:22). Simon covers ML workflows that separate batch inference from online serving (Simon Stiebellehner, Building Production ML Platforms, 31:15-31:51).

A strong Airflow use case usually has a real pipeline operations need (Data Engineering Platforms, Data Quality and Observability):

several jobs that must run in order
scheduled backfills or partition reruns
shared run history for multiple teams
retries and alert owners
data quality checks before publication
batch ML jobs that need reproducible sequencing
conventions for many similar pipelines

Mehdi OUAZZA gives the platform version of this point. Around 17:22, he says an Airflow cluster isn’t the whole data platform. Teams also need conventions, playbooks, and best practices for using Airflow (Scaling Data Engineering Teams). That connects Airflow to self-service data platforms and DataOps.

In those operating models, teams make the orchestrator useful through templates and deployment paths. They also use tests, alerts, and ownership.

Too Much Airflow

Airflow isn’t the right first move for every scheduled job. Around 35:46, Andreas Kretz compares Airflow with cloud schedulers and simpler services. Around 41:06, he recommends starting with simpler infrastructure. Teams can move toward Airflow or Kubernetes when they need more logging, insight, and control (From Notebooks to Production).

Adrian Brudaru makes a similar data engineering point. Around 35:37, he names Airflow beside Prefect, Dagster, and GitHub Actions. Around 37:08, he says GitHub Actions can be enough for simple workflows because it avoids the cost of always-on orchestrators (Modern Data Engineering Trends).

Delay Airflow when the workflow is one small script and the schedule is simple. It also makes sense to wait when manual reruns are acceptable and no one needs shared run history (Data Engineering Tools, Batch vs Streaming). Choose a workflow service such as Airflow, Prefect, or Dagster when informal scheduling no longer gives the team enough visibility and recovery (Orchestration).

Airflow vs dbt and Airbyte

Airflow answers a different question than dbt or Airbyte. Airbyte moves data into storage, dbt transforms models, and Airflow schedules the surrounding workflow. Natalie Kwong makes this split in Data Engineering Tools and Modern Data Stack at 30:59-31:31.

That separation is useful in project design.

A DAG can start ingestion and wait for raw data. It can then run transformations and publish a checked table. Keep the DAG focused on orchestration instead of hiding all business logic inside Airflow code. Jeff Katz makes the code-structure point at 55:10 in Build a Data Engineering Career. See Airflow Docker Compose for local DAG structure.

For learners and portfolio projects, this distinction is important because Airflow should demonstrate dependencies and retries, plus logging and recovery. It shouldn’t decorate a pipeline that could run as one command (Data Engineering Portfolio Projects, Data Engineering Pipeline Project).

Airflow in ML and Feature Pipelines

Airflow also appears in ML infrastructure when the work is batch-oriented. Around 31:15, Simon Stiebellehner separates batch inference from online serving. Around 31:51, he discusses Airflow and production workflows as orchestration choices (Building Production ML Platforms, ML Platforms).

Feature stores create another boundary. Willem Pienaar explains that some feature stores don’t handle upstream transformations. Feast is one example, and teams may keep those transformations in dbt or Spark. They may also orchestrate them through Airflow before features reach the feature store (Feature Stores for MLOps, 42:30, and Machine Learning Infrastructure).

So Airflow can coordinate upstream feature generation and batch ML jobs. It isn’t the feature store or model registry. It also isn’t the online serving layer or monitoring system (MLOps Tools, Production).

Operating Airflow Well

A green Airflow run only proves that scheduled tasks reached a successful state. It doesn’t prove that the records are complete or that the metric is right. It also doesn’t prove that the downstream dashboard, feature, or activation workflow is safe to use (Data Quality and Observability, Data Activation).

Tomasz Hinc gives the sharpest warning in the archive. Around 1:02:28, he describes green Airflow jobs that inserted zero records. He uses the story to argue for pragmatic edge-case checks and confidence beyond task status (DataOps and GitOps Best Practices for Data Teams).

Good Airflow practice therefore belongs with DataOps:

version pipeline code
control dependencies
test transformations
check data
route alerts
keep runbooks and clear owners

These operating practices connect Airflow to (DataOps and Data Quality and Observability).

Airflow supplies the scheduling and run history. The team still has to define what correct data means and what to do when a run fails.

Learning Airflow

Learn Airflow after you can already build a small data pipeline with Python and SQL. Add storage and transformations before Airflow. Add checks too.

Jeff Katz places Airflow after Python/SQL in the data engineering learning path. He also puts Docker before Airflow. Cloud basics and warehouses come first too. Around 55:10, he says good Airflow code keeps most logic in normal Python instead of relying on Airflow-specific code in Build a Data Engineering Career. See Data Engineer Role for the broader role context.

A useful Airflow learning project should make orchestration visible by showing the schedule, dependencies, retries and logs. It should also include a backfill path, data checks and owner notes. Then link those pieces to a real data engineering pipeline rather than treating Airflow as the whole project (Airflow Docker Compose, Data Engineering Pipeline Project).

If the project is still a single script with no meaningful dependency graph, start with a command or cron job. A cloud scheduler or GitHub Actions workflow can fit the same simple case. Move to Airflow when the pipeline needs shared operational state.

Andreas Kretz supports that staging path in From Notebooks to Production at 35:46-41:06. Adrian Brudaru supports the same idea in Modern Data Engineering Trends at 35:37-37:08.