Back to Overview2024–2025

Data Engineering

Cloud platforms, tools, and challenges in data engineering workflows.

Which data storage solutions do you use? (Select all that apply)

Relational databases lead at 71%, with warehouses and lakehouses both at 54%. Data lakes (50%), NoSQL (28%), and vector DBs (19%) follow. Most teams run a mix of traditional and modern storage.

Which data warehouse solutions do you use? (Select all that apply)

BigQuery (39%) and Snowflake (32%) lead. Redshift (25%) and Synapse (20%) are next; ClickHouse (8%) has a smaller share. Cloud warehouses are consolidated around the major vendors.

Which data lake solutions do you use? (Select all that apply)

S3 leads at 53%, with GCS (34%) and Azure Data Lake (31%) next. HDFS (19%) is still present but cloud object storage dominates.

Do you use any lakehouse architecture solutions? (Select all that apply)

58% don't use lakehouse; among adopters, Databricks (31%) leads. Iceberg (13%) and Delta Lake (12%) are the main open formats. Lakehouse was still emerging for many teams.

Which workflow orchestration tools do you use to manage data pipelines? (Select all that apply)

Airflow dominates at 48%; 36% don't use orchestration. Step Functions (12%), Mage (7%), Prefect (7%), and Dagster (5%) follow. Airflow was the standard.

Which data integration or ETL/ELT tools do you use? (Select all that apply)

dbt leads at 34%; 46% don't use ETL/ELT tools. Airbyte (8%), Fivetran (8%), dlt (7%), and Talend (5%) follow. Many still relied on custom or manual solutions.

Which frameworks do you use for data processing? (Select all that apply)

Pandas (70%) and Spark (47%) lead; 15% don't use dedicated frameworks. Flink (8%), Beam (5%), and Dask (4%) have smaller footprints.

Do you use any data observability or monitoring tools for your pipelines? (Select all that apply)

77% don't use data observability tools. Great Expectations (10%) and Monte Carlo (6%) are the most noted; Soda and Databand are at low rates. Observability was under-adopted.

How do you ensure data quality in your workflows? (Select all that apply)

49% do manual checks and 39% have automated tests; 27% have no dedicated practices. Validation tools like Great Expectations (23%) are used by a smaller share.

Which data governance tools or practices do you use? (Select all that apply)

65% don't use data governance tools. Manual cataloging (20%) is most common; Apache Atlas (6%), Collibra (5%), and Alation (4%) have limited use. Governance was a clear gap.

Do you work with real-time data processing?

46% don't do real-time; 28% have minimal requirements and 26% use dedicated frameworks (Kafka, Flink). Batch was still the norm for most.

Cloud Platforms for Data Engineering

AWS (47%), Azure (35%), and GCP (34%) lead; 22% use on-premise. Multi-cloud and hybrid were common.

Challenges in Data Engineering

Data Quality(69%)Integrating Heterogeneous Data Sources(60%)Scaling Data Pipelines(54%)Security, Privacy, and Compliance(40%)

Data quality (69%) and integration of sources (60%) are the top two. Scaling pipelines (54%) and security/compliance (40%) follow. Same themes as today.

Data Engineering Team Size

50% have 1–5 people; 24% have no dedicated team (0). 6–10 and 11–20 are about 9% and 8%; larger teams (21–50, 51+) are a minority. Small teams were the norm.