Wiki

Data Engineering Tools

A podcast-backed guide to choosing data engineering tools across ingestion, orchestration, storage, transformation, quality, governance, and activation.

Data engineering tools help teams move data from source systems into trusted analytics, operations, and machine learning work. The useful question isn’t “which modern data stack tools should we buy?” It’s “which data flow do we need to make reliable, and who depends on it?”

The DataTalks.Club archive is consistent on that point. In ETL vs ELT and the Modern Data Stack, Natalie Kwong explains the basic stack through ingestion, warehouse loading, and dbt-style transformation. She also covers orchestration, lake storage, change data capture, and reverse data flows.

In Modern Data Engineering, Adrian Brudaru adds a newer view. He covers open table formats, catalogs, and DuckDB. He also covers AI-assisted pipeline work, streaming, and more careful vendor choices. Read those episodes with the Data Engineering, Modern Data Stack, and Data Engineering Platforms wiki pages open.

Stack Map

Most teams combine tools from these categories:

You rarely need every category in one buildout. In the modern data stack episode, Kwong starts with extraction, loading, and warehouse-side transformations. At roughly 30:59-35:42, she adds Airflow, dbt, and reverse data flows. Brudaru warns against vendor-led tool collection around 44:42 in Modern Data Engineering. In that discussion, he says teams should choose tools after they understand the business requirement, team skills, and operating cost.

That matches the archive’s broader role guidance. In Build a Data Engineering Career, Jeff Katz treats Python, SQL, and cloud basics as core skills. He adds orchestration before advanced tools. Around 38:05 and 56:46, he explains why junior programs shouldn’t over-index on Spark, Kafka, and Kubernetes before students can write and reason about pipelines.

Ingestion And ETL vs ELT

Ingestion tools extract data from source systems and load it into a warehouse, lake, lakehouse, or staging area. They include managed connectors, Python ingestion libraries, event collection tools, and change data capture systems.

In ETL vs ELT and the Modern Data Stack: Kwong uses Airbyte to explain the ingestion layer around 3:19. The tool moves data from sources such as ads APIs into warehouses such as Snowflake. Around 45:59, she explains change data capture as syncing only row-level changes instead of reloading a whole source each time. That makes CDC useful when database changes matter and full reloads are too slow or too expensive.

The ETL vs ELT choice shapes the rest of the stack. ETL transforms before loading, which can fit compliance, source constraints, or large enterprise staging needs. Kwong still gives ETL a place around 41:30 when enterprise systems need complex staging. ELT loads first and transforms later, which gives analysts and analytics engineers more room to model in SQL. Around 7:57 and 18:47, Kwong connects ELT to flexibility, warehouse-side transformations, and faster iteration.

Product event ingestion adds another requirement: a tracking plan. In Data-Led Growth, Event Tracking, and Reverse ETL, Arpit Choudhury spends 13:34-22:47 on event naming and properties. He also covers ownership and collection before connecting storage and activation. For product data, a connector alone doesn’t solve the problem. Teams need to know which events exist, what each property means, and who owns changes to the event schema.

Orchestration And DataOps

Orchestration tools coordinate jobs by scheduling ingestion and triggering transformations. They manage dependencies and retries, and they support backfills and alerts. Kwong places Airflow in that role around 30:59 in the modern data stack episode. Brudaru updates the tool landscape around 35:37 in Modern Data Engineering, where he compares Airflow, Prefect, and Dagster. He also discusses GitHub Actions.

Orchestration becomes more important as team size and failure cost grow. In Scale Data Engineering Teams, Mehdi OUAZZA explains that a scale-up data platform needs self-service onboarding and Airflow. Around 12:30-17:22, he adds conventions, playbooks, and best practices. Around 23:26, he moves from batch coordination to Kafka, schema registry, and data contracts for event streaming. The tool choice changes because the team now has more producers, more consumers, and more ways to break each other.

DataOps is the operating layer around those tools. In Mastering DataOps, Christopher Bergh frames the problem as error reduction and deployment cycle time as well as team productivity around 6:42. Around 33:47-51:21, he names version control, tests, and CI/CD.

He also covers runbooks and automation. Later in the same stretch, he discusses dbt and Great Expectations alongside SQL tests and end-to-end versioning. Tools need a release path and recovery process.

Warehouses, Lakes, And Lakehouses

Warehouses fit teams that need governed SQL analytics, BI, marts, and warehouse-side transformation. Kwong ties warehouses to ELT, data marts, dbt-style modeling, and reverse data flows in ETL vs ELT and the Modern Data Stack around 15:30-35:42. Many analytics-heavy teams follow this route. They load raw data, transform it into documented models, and serve BI or operational syncs from trusted tables.

Data lakes fit raw files, logs, media, and semi-structured data. Around 19:50, Kwong describes lakes as storage for files, logs, and media. Around 21:22, she warns that teams can create a data swamp without governance. Use the Data Lake and Data Warehouse pages for the basic split.

Lakehouses add table behavior and transaction semantics on top of open storage. Brudaru focuses on that newer layer in Modern Data Engineering.

Around 18:17-30:31, he explains Apache Iceberg, Parquet storage, and catalogs. He also covers metadata and lineage. The same segment discusses DuckDB, Delta Lake, and headless table formats. Around 49:42, he compares Delta, Hudi, and Iceberg.

Those tools matter when a team wants open storage, multiple compute engines, better cost control, or less vendor lock-in. They also add platform complexity, so compare them with Data Warehouse vs Data Lakehouse, Apache Iceberg, and Delta Lake.

Transformation And Analytics Engineering

Transformation tools turn raw or staged data into models that analysts, product teams, executives, and ML systems can use. In an ELT stack, that often means SQL transformations in the warehouse or lakehouse.

Kwong links ELT to dbt and the rise of the analytics engineer around 12:39-18:47 in ETL vs ELT and the Modern Data Stack. In Master Analytics Engineering, Victoria Perez Mola describes the role from the other side. Around 4:05-10:04, she covers data modeling and pipelines. She also covers data quality and Looker. In the same discussion, she explains SQL transformations and version control.

Around 36:44-38:53, Perez Mola connects dbt cleaning and macros to bad data. She also discusses tests, upstream checks, and schema changes.

That’s why transformation tools belong with Analytics Engineering and dbt, not only with platform engineering. dbt is valuable when it makes business definitions reviewable, testable, documented, and reusable. It’s less useful if a team treats it as a brand name for scattered SQL. Brudaru makes the same point around 31:29 in Modern Data Engineering, where he discusses dbt’s influence and alternatives such as SQLMesh.

Quality, Observability, And Governance

Data quality tools check whether data is fit for use. Data observability tools help teams detect and diagnose changes in that fitness. This category matters as soon as people make decisions, send customer segments, train models, or run operations from the data.

In Data Observability Explained, Barr Moses explains why data teams often find out about problems from executives, customers, or business users around 4:35-6:56. Around 16:38, she lays out five observability pillars. They cover freshness and volume, distribution and schema, plus lineage. Around 24:31-41:03, she separates monitoring from diagnosis. She also discusses root cause analysis, data SLAs, accountability, and runbooks.

Those ideas connect directly to Data Quality and Observability, Data Observability, and Data Governance. A freshness check, schema test, or lineage graph isn’t a decorative platform feature. It helps the team decide whether a dashboard, reverse ETL sync, or ML feature pipeline can still be trusted.

DataOps episodes add the delivery discipline through Mastering DataOps, where Bergh discusses observability, monitoring, and tests. He also covers CI/CD and end-to-end versioning.

In DataOps and GitOps for Data Teams, Tomasz Hinc shows the platform side around 12:40-26:21. He covers Terraform, Terragrunt, Atlantis, and GitOps. He also covers onboarding, secrets, and IAM. Around 1:01:27-1:02:28, he adds fixed versions, Docker, and pragmatic checks. Quality tools work best when teams pair them with ownership, deployment habits, and incident response.

Activation And Reverse ETL

Reverse ETL tools send modeled data from the warehouse into CRM, sales, and support systems. They also feed marketing, engagement, and product tools. This category turns analysis into action, but it also turns analytics definitions into operational dependencies.

Choudhury gives the clearest archive example in the Data-Led Growth episode. Around 28:52-37:25, he walks through warehouses and dbt. He connects them to BI and product analytics. He also covers warehouse-centric tools and reverse ETL products such as Census, Hightouch, and Grouparoo.

Around 37:25-43:50, he discusses customer data platforms and warehouse-first stacks. He also discusses buy-vs-build tradeoffs and the team roles around data-led growth.

Kwong also covers operational reverse data flows around 35:42 in ETL vs ELT and the Modern Data Stack. Teams add reverse ETL when sales, support, marketing, or product teams need trusted segments inside their tools. They may also need lifecycle signals, product-qualified accounts, or customer context. Use Reverse ETL, Data Activation, and Customer Data Platforms for the broader topic.

The risk is concrete too. If identity resolution breaks or a sync becomes stale, customers and internal teams may see the wrong action. A model definition can cause the same problem when it changes without review. Reverse ETL should inherit upstream ownership, tests, and permissions. It also needs lineage, runbooks, and clear definitions.

Tool Choice Checklist

Start with the business use case, then choose the tools.

  1. Name the consumer: analyst, executive, data scientist, ML system, sales team, support team, product team, or customer-facing feature.
  2. Name the action: reporting, experimentation, personalization, forecasting, training, operational alerting, compliance, or activation.
  3. Set freshness needs: daily batch, hourly updates, micro-batch, streaming, or real-time product response.
  4. Pick the storage design: warehouse for governed SQL analytics, lake for raw files and flexible storage, or lakehouse for open table formats and multiple compute engines.
  5. Choose transformation ownership: central data engineering, analytics engineering, domain teams, or a self-service platform with guardrails.
  6. Add quality gates where failure is costly: schema checks, freshness checks, row counts, uniqueness tests, lineage, and alert routing.
  7. Add orchestration when dependencies, retries, backfills, and ownership need a control plane.
  8. Add DataOps practices before the stack becomes business critical. Start with version control plus tests, then add CI/CD and recovery routines with runbooks and deployment paths.
  9. Check maintenance cost, security, governance, lock-in, and team skills before adding specialized tools.

The episodes converge on the same rule. Katz starts with SQL, Python, cloud basics, and orchestration before tool sprawl in Build a Data Engineering Career. Kwong starts with the movement of data and the ETL/ELT tradeoff in the modern data stack episode. Brudaru pushes teams to choose tools only after understanding requirements and operating cost in Modern Data Engineering. Bergh and Moses add the reliability layer through DataOps and data observability.

Use these pages to go deeper into the main tool categories: