Back to Overview2025–2026

Data Engineering

Cloud platforms, tools, and challenges in data engineering workflows.

Working with Data Engineering Tools

Everyone's doing data engineering, it's the foundation for analytics, ML, and BI. No real split here; data work is table stakes for most teams.

Which data storage solutions do you use? (Select all that apply)

Relational databases lead at 76%, Postgres, MySQL, etc. are everywhere. Data lakes (65%) and warehouses (57%) are next; 30% use lakehouses. NoSQL (19%) and vector DBs (9%) are for specific use cases. Most teams run a mix.

Which data warehouse solutions do you use? (Select all that apply)

BigQuery (34%) and Snowflake (32%) are neck and neck, the cloud warehouse space is split. Redshift (17%), Synapse (15%), and ClickHouse (15%) follow. No single winner; depends on cloud and scale.

Which data lake solutions do you use? (Select all that apply)

S3 leads at 48%, it's the default for object storage. Azure Data Lake (33%) and GCS (28%) are next; 17% still use HDFS. Cloud object storage has mostly replaced on-prem lakes.

Do you use any lakehouse architecture solutions? (Select all that apply)

34% use Databricks and 34% don't use lakehouse at all, split down the middle. Delta Lake (20%) and Iceberg (12%) are the main open formats; Hudi and Fabric show up at single digits. Lakehouse is still emerging for many teams.

Which workflow orchestration tools do you use to manage data pipelines? (Select all that apply)

Airflow dominates at 46%, it's the default for pipeline orchestration. 19% don't use orchestration tools. Step Functions (15%), Prefect (10%), and Kestra (6%) are next; Dagster and others are niche. Airflow is still the standard.

Which data integration or ETL/ELT tools do you use? (Select all that apply)

dbt leads at 46%, the modern data stack is dbt-first. 22% don't use ETL/ELT tools; dlt (20%), Airbyte (9%), and Fivetran (7%) follow. Transform-in-warehouse with dbt has won for a lot of teams.

Which frameworks do you use for data processing? (Select all that apply)

Pandas is on top at 70%, still the go-to for Python data work. Spark (58%) for scale; 14% don't use dedicated frameworks. Polars (8%) and Flink (6%) are growing. Pandas + Spark covers most use cases.

Do you use any data observability or monitoring tools for your pipelines? (Select all that apply)

77% don't use data observability tools, big gap compared to application monitoring. Great Expectations (5%) and a long tail of custom or niche tools (Datadog, Datafold, Monte Carlo, etc.) show up. Data observability is still under-adopted.

How do you ensure data quality in your workflows? (Select all that apply)

53% do manual checks and 51% have automated tests in pipelines, most teams use both. 19% don't have dedicated data quality practices. Automated validation tools like Great Expectations (16%) are used by a smaller share. Quality is often manual or pipeline-embedded.

Which data governance tools or practices do you use? (Select all that apply)

51% don't use data governance tools, another clear gap. Data cataloging (37%) is the most common practice; Apache Atlas and commercial tools (Collibra, Alation, Purview) show up at low rates. Governance is still manual or skipped for many.

Do you work with real-time data processing?

43% don't do real-time; 31% have minimal requirements and 27% use dedicated frameworks (Kafka, Flink). Batch is still the norm for most; real-time is for a subset of use cases.

Cloud Platforms for Data Engineering

AWS leads at 40%, with Azure close behind at 35%. 27% are on-premise and 25% use GCP, lots of multi-cloud and hybrid. Platform-agnostic tools matter because people run workloads everywhere.

How would you describe your data engineering maturity?

46% are emerging, scheduled pipelines, partial standardization. 33% are established with orchestration, monitoring, versioning; 15% are still ad-hoc. Only 6% say advanced (governed platform, SLAs, self-serve). Most teams are in the middle of the maturity curve.

Role of Data Engineering Tools

62% say their data tools are mission-critical, when pipelines break, everything breaks. 25% use them regularly but not for core workloads, and only 12% are still in experimental mode. Most teams are past the pilot stage.

Challenges in Data Engineering

Data Quality(57%)Scaling Data Pipelines(51%)Integrating Heterogeneous Data Sources(49%)Cost Management(45%)Lack of Standards, Ownership, or Governance(39%)Security, Privacy, and Compliance(31%)Skills, Hiring, or Career Constraints(29%)Reliability, Monitoring, and Incident Response(27%)Production Readiness and Best Practices(4%)Compute Fragmentation (Multiple Engines / Runtimes)(2%)

Data quality is the top concern at 57%, garbage in, garbage out. Scaling pipelines (51%) and integrating heterogeneous sources (49%) are next; cost (45%) and lack of standards or governance (39%) round it out. You need both good tech and clear ownership.

Data Engineering Team Size

49% have 1–5 people, small teams are the norm. 15% have no dedicated data engineering team (0), and 13% each have 6–10 or 11–20. Larger teams (21–50, 51+) are a minority. Modern tools let small teams own a lot.

Which data engineering tools or technologies do you plan to adopt or expand in the next 12 months?

Databricks(23%)Apache Spark (Batch & Streaming)(20%)Cloud Platforms (AWS / GCP / Azure)(20%)Workflow Orchestration (Airflow / Prefect / Kestra)(17%)Data Warehouses (Snowflake)(13%)Observability & Monitoring (Grafana / Data Monitoring)(13%)dbt(10%)Streaming Platforms (Kafka / Kinesis / Flink)(10%)Data Governance & Quality (Dataplex / Automated Tests)(10%)Programming & Query Engines (Python / SQL)(10%)

Databricks (23%) and Apache Spark (20%) lead the adoption list, with cloud platforms (20%) and workflow orchestration like Airflow/Prefect/Kestra (17%) next. Snowflake (13%), observability (13%), and governance/quality tools (10%) are also on the roadmap. Plans are spread, teams are investing across the stack.