Guide

Airflow Docker Compose: Local Setup for Data Pipeline Projects

A practical setup for running Airflow locally with Docker Compose for data pipeline projects, with DAG structure, mounted code, checks, logs, and limits.

Related Wiki Pages

Apache Airflow Orchestration Data Pipelines Data Engineering Tools Data Engineering Platforms DataOps Data Quality and Observability End-to-End Data Pipeline Project

Use Airflow with Docker Compose after your pipeline has steps that Airflow should coordinate. Docker Compose can start the Airflow web UI and scheduler on a laptop. It can also start the metadata database, DAG folder, and logs. That makes the setup useful for learning, DAG development, teaching, and portfolio review.

The local setup workflow sits under Apache Airflow for the tool concept and orchestration for the broader control-plane structure. The broader pipeline build sequence lives in How to Build Data Pipelines and End-to-End Data Pipeline Project.

Daniel Egbo gives concrete Airflow-in-Compose evidence in From Radio Astronomy to Machine Learning and Data Engineering. At 42:48-46:52, he discusses course projects with Airflow. The local services include MinIO, Spark, and MySQL. He also covers a warehouse path, Compose setup, environment variables, and the Airflow web server.

Gloria Quiceno adds the reproducibility reason. At 21:25 in Get a Data Analytics and Data Engineering Job, she explains how Docker made scripts easier to share and run across machines.

Start With A Local Airflow Use Case

Start with one local workflow, not a full platform. A good first project has a source and a raw output. It also has one transformation, one check, and one published result. Airflow should coordinate those steps after the work is already clear.

Natalie Kwong gives the tool boundary in ETL vs ELT and Modern Data Engineering at 30:59. Airflow schedules and orchestrates work. Airbyte handles extract-load work, and dbt handles warehouse transformations. Santona Tuli names Airflow, Prefect, Dagster, and Mage as orchestration choices at 26:43 in Modern Data Pipeline Architecture. The important decision is how the workflow is broken up, not which scheduler name appears in the README.

Use Docker Compose when the project needs at least three local Airflow pieces:

A scheduler that reads DAGs and creates task runs.
A web UI where a reviewer can look at runs and logs.
A metadata database that preserves DAG and task state.
Mounted folders for DAGs, logs, plugins, and project code.
Local services that mimic real dependencies, such as a database or object store.

Use Compose For Airflow Runtime Pieces

The official Airflow Docker guide provides the current Compose quick start for learning and exploration. Use the Apache Airflow Docker Compose guide for the exact file and version-specific commands. Don’t treat the official quick-start Compose file as a production deployment because it doesn’t provide production security guarantees.

Daniel’s local Airflow project shows why these runtime pieces matter in a learning setup. His Compose stack included Airflow plus local services such as MinIO, Spark, and MySQL. The setup also exposed environment variables and the Airflow web server (42:48-46:52).

For a portfolio project, keep the setup visible and boring:

project/
  dags/
  logs/
  plugins/
  include/
  src/
  tests/
  docker-compose.yaml
  .env
  README.md

Use these folders deliberately:

Put DAG definitions in dags/.
Put normal Python modules in src/.
Put small sample inputs, SQL files, or config files in include/.
Persist logs/ so failed task runs remain visible after a restart.
Keep tests outside Airflow so they can run without the scheduler.

The local startup sequence should be easy to explain:

Download or copy the official Compose file for the Airflow version you use.
Create dags/, logs/, plugins/, and project-code folders.
Set the local Airflow user ID in .env when the host needs it.
Initialize Airflow metadata and the first user.
Start the stack.
Open the Airflow UI and trigger one DAG.
Look at task logs and the output table, file, or dashboard input.

Keep DAGs Thin

Airflow DAG files should describe order and ownership. They should also describe retries, parameters, and calls into real work. They shouldn’t contain most extraction, transformation, or validation logic.

Lars Albertsson gives the platform version in DataOps 101 for Scaling Data Platforms. Around 30:34-35:57, he places a workflow engine next to storage and compute. The engine tracks dependencies and schedules work. It retries after late data, transient infrastructure failures, or bugs. The compute layer does the real processing.

Jeff Katz makes the same point for learners in Data Engineering Career Path and Skills. Around 55:10, he says good Airflow code keeps most logic in normal Python instead of relying on Airflow for everything. In a Compose project, that means the DAG should call tested modules, SQL, or dbt commands. It can also call containerized jobs. The DAG shouldn’t become the only place where the pipeline logic exists.

Mount Code, Data, And Logs Deliberately

Mount only the folders that explain the project. A reviewer should see where DAGs live, where supporting code lives, and where logs appear after a task fails. A small local database or object store can make the workflow realistic, but each extra service should support the story.

Daniel’s course-project discussion is useful here because he worked through Airflow with MySQL, MinIO, Spark, and a warehouse in a local learning context. Not every project needs all of those services. Use Compose when the project needs to show how orchestration connects the runtime pieces.

Gloria’s capstone gives a simpler version. Around 50:15 in Get a Data Analytics and Data Engineering Job, she discusses a Twitter data pipeline with Docker containers and Slack delivery. That kind of project can use Airflow later when the run path needs scheduling, dependency state, retries, or backfills.

Pin dependency versions in the image or requirements file. In DataOps and GitOps Best Practices, Tomasz Hinc describes an unpinned Python dependency causing a containerized job failure after an API change around 1:01:27. Local Compose should reduce dependency surprises, not hide them.

Add Checks, Logs, And Rerun Behavior

A green Airflow run proves that tasks finished, not that the data is correct. Tomasz Hinc makes this boundary explicit around 1:02:50 in DataOps and GitOps Best Practices. He describes Airflow jobs that were green while the output had zero rows. Add a data check that fails when the output is wrong instead of trusting the Airflow UI alone.

Add checks that can fail the run:

Source file or API response exists.
Row count is above a minimum.
Required columns are present.
Primary key or serving-table grain is unique.
Freshness is within the expected window.
Nulls and accepted values match the consumer expectation.

Use one deliberate failure in the project by breaking the input path, adding a duplicate row, or making a freshness check fail. Show how Airflow records the failure and where the log appears. Then show how the DAG reruns after the input is fixed.

Include the failed run alongside Lars Albertsson’s workflow-engine discussion around 30:34-35:57 in DataOps 101 for Scaling Data Platforms. He puts dependencies, schedules, and retries in the workflow engine, while the compute layer does the processing.

Know Where Local Compose Stops

Docker Compose isn’t a production Airflow platform. It’s a local environment for learning, development, and repeatable review.

Mehdi OUAZZA gives the platform boundary in Scaling Data Engineering Teams. Around 17:22-19:25, he says an Airflow cluster is only one part of a platform. Teams also need naming conventions and sequence rules. They need playbooks, onboarding, and support paths too. A local Compose file is further from production than that.

Avoid Airflow when one simpler scheduler would make the project clearer. Andreas Kretz compares Airflow with CloudWatch and Lambda around 35:46 in From Notebooks to Production. Around 41:06, he recommends starting simple and moving toward Airflow or Kubernetes when the workflow needs more control. Adrian Brudaru makes the same point around 35:37-37:08 in Modern Data Engineering Trends: GitHub Actions can be enough for simple workflows.

Move beyond local Compose when the project needs operational controls:

shared deployment
managed secrets
searchable log retention
multiple users
worker isolation or autoscaling
production alerting
access control

Airflow Docker Compose Setup Checklist

Use this local project checklist:

Name the data product, consumer, and refresh cadence, matching Santona Tuli’s emphasis on marts, dashboards, business questions, and persona-driven pipeline design around 43:05 and 52:54 in Modern Data Pipeline Architecture.
Keep one DAG focused on one pipeline so Airflow tracks one dependency chain at a time. Lars describes dependency tracking, scheduling, and retries as workflow-engine responsibilities around 30:34-35:57 in DataOps 101 for Scaling Data Platforms.
Put business logic in src/, SQL files, dbt models, or separate containers, following Jeff Katz on keeping most Airflow project code in normal Python around 55:10 in Data Engineering Career Path and Skills.
Mount dags/, logs/, plugins/, and supporting project code, as in Daniel Egbo’s local Airflow project around 42:48-46:52 in From Radio Astronomy to Machine Learning and Data Engineering. His setup included MinIO and Spark. It also included MySQL, environment variables, and the Airflow web server.
Persist logs and metadata long enough to look at failures, because Tomasz treats log reading and troubleshooting as basic operating skills around 44:23 in DataOps and GitOps Best Practices.
Pin Python, provider, and system dependencies. Tomasz’s dependency example around 1:01:27 shows how an unpinned Python package can break a containerized job after an API change.
Add at least one data check that can fail the run. Tomasz’s Airflow example around 1:02:50 shows why green tasks aren’t enough.
Show one retry, rerun, or backfill scenario, matching Lars’s workflow-engine discussion around 30:34-35:57 in DataOps 101 for Scaling Data Platforms.
Document the startup steps in the README, so the project keeps the sharing benefit Gloria Quiceno describes for Docker around 21:25 in Get a Data Analytics and Data Engineering Job.
Link back to the production boundary so readers don’t treat the local stack as a platform. Mehdi OUAZZA puts Airflow inside a broader platform with conventions, playbooks, onboarding, and support paths around 17:22-19:25 in Scaling Data Engineering Teams.

When you use the page for interview or portfolio preparation, don’t stop at “I ran Airflow.” Show that one local command starts the pipeline. Show that the DAG calls real tested work, a bad input fails visibly, and the project explains when a simpler scheduler would have been enough.