Modern Data Engineering: Iceberg, Delta Lake & AI-Powered Pipelines | Adrian Brudaru
Listen to or watch on your favorite platform
Show Notes
How can engineering teams build reliable, scalable lakehouse pipelines that combine transactional table formats with AI-driven automation? In this episode Adrian Brudaru—an economics-trained analyst turned freelance data practitioner and co-founder of a data company focused on open source tooling—joins us to explore the realities of modern data engineering.
Adrian draws on years of startup and freelance experience and a current mission to democratise data engineering through open source to discuss the practical trade-offs between Iceberg and Delta Lake, how table formats fit into a data lakehouse architecture, and where AI can augment pipeline development and observability. Key topics include selecting the right table format for versioning and governance, integrating AI-powered features into ETL/ELT workflows, and the role of open source tools in scaling data platforms.
Listen to gain grounded perspectives on Iceberg, Delta Lake, AI-powered pipelines, and data pipeline best practices—especially useful for data engineers, architects, and engineering managers evaluating lakehouse strategies or looking to adopt open source solutions.
Timestamps
Transcript
The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.
Episode opening & guest introduction
Alexey: This week, we’ll talk about trends in data engineering. Our special guest today is Adrian, a returning guest. This is his third time on the podcast, but he’s also been part of workshops and open-source demos. Many of you probably know him already. (1.0)
Alexey: Adrian is the co-founder of DLT Hub, the company behind DLT, which we’ll discuss today. We’ll also talk about trends in data engineering. When we launched our Data Engineering Zoom Camp recently, one of the questions from participants was, “How do you see data engineering evolving, and what are the trends?” I realized I couldn’t answer that, so I thought of Adrian. (1.0)
Alexey: Welcome to the podcast, Adrian. It’s a pleasure to have you here again. (1.0)
Perspective on evolving data engineering challenges
Adrian: The pleasure is mine. I don’t have all the answers, but I can share my perspective and observations. (2:23)
Alexey: That’s what makes it interesting. Everyone has opinions, and you’re much closer to data engineering than I am. (2:35)
Alexey: As always, the questions for today’s interview were prepared by Johanna Bayer. Thank you, Johanna, for your help. (2:35)
Alexey: Before we dive into trends, let’s start with your background. Can you briefly share your career journey? (2:35)
Career journey: startups, freelancing, founding DLT
Adrian: Sure. I started in data in 2012. I spent five years working at startups in Berlin, building data warehouses. Then I joined an enterprise but didn’t enjoy it, so I became a freelancer for five years. (3:10)
Adrian: After freelancing, I wanted to do more than just consulting. I didn’t enjoy managing an agency, so I decided to build DLT, a tool I wish I had as a data engineer. That’s the short version. (3:10)
Alexey: You mentioned DLT is a tool you wish you had as a data engineer. Do you use it often? (3:56)
DLT as a Python-based ingestion standard and market impact
Adrian: I use it all the time. As a data engineer, much of your work involves data ingestion—taking data from semi-structured formats like JSON and transforming it into a structured format. DLT automates this process, so there’s no reason to use anything else. (4:03)
Alexey: Do you remember when we last spoke on the podcast? (4:26)
Adrian: Just over a year ago, I think. (4:31)
Alexey: Yes, over a year ago. You were already at DLT, and we talked about your journey from freelancing to founding a company. (4:33)
Alexey: Since then, a lot has changed. I’ve enjoyed visiting your office and seeing your team grow. You’ve moved to a bigger space, and it’s been amazing to witness your progress. (4:33)
Alexey: From an outsider’s perspective, it’s been exciting. Can you share what’s been happening internally over the past year? (4:33)
Adrian: We started by squatting in other people’s offices, but now we’ve grown significantly. Last year, I talked about creating a standard for data ingestion with DLT, and I think we’ve succeeded. (5:53)
Adrian: DLT has become a standard for Python-based data ingestion. We’ve commoditized this market, making it accessible to everyone. This has challenged the industry to offer more value beyond just connectors. (5:53)
Alexey: Are you still actively working as a data engineer, or what does your role look like now? (7:33)
DLT Plus vision and partnership outreach for freelancers
Adrian: Right now, I’m focused on DLT Plus, figuring out how to make it a meaningful product. We’re building a data platform around DLT, incorporating best practices and innovation. (7:45)
Alexey: So your work involves figuring out how DLT Plus will function? (8:47)
Adrian: Yes, as a founder, my role involves exploring initiatives, listening to feedback, and testing ideas. I’m also working on building a partnership network to maximize value for customers. (8:55)
Alexey: For freelancers listening, how can they reach out to learn more about this partnership network? (9:50)
Adrian: They can reach out on LinkedIn or find partnership links on our website. This is mainly for people already using DLT. (10:07)
Alexey: You’ve been in the data engineering industry for over a decade. How has the industry changed over the past five years? (10:21)
Industry shift toward specialization: governance, data quality, streaming
Adrian: Five years ago, there was a shortage of data engineers, and anyone who could do basic integration was called a data engineer. Now, the field is becoming more specialized. (11:03)
Adrian: We’re seeing data engineers focus on areas like data governance, data quality, and streaming. These specializations often have little overlap, driven by industry requirements like handling sensitive data or new use cases like energy management. (11:03)
Alexey: For junior data engineers, is it harder to enter the field now? (12:17)
Early-career opportunities: AI projects and startup hiring
Adrian: It depends. There’s still opportunity, especially in AI. Many companies are exploring AI and need data engineers to support those efforts. Startups also need help building modern data stacks, which has become easier with tools like DLT. (12:37)
Alexey: You mentioned the modern data stack. How modern is it, and what does it entail? (14:08)
Modern data stack critique and open-source "postmodern" alternatives
Adrian: The modern data stack is largely marketing. It’s a package of software from various vendors that you can combine to build a data platform. (14:32)
Adrian: For example, Fivetran with Snowflake and Looker. Vendors needed a way to sell together, and the modern data stack concept was born. It’s effective marketing, but not necessarily modern. (14:32)
Adrian: Now, people are talking about the postmodern data stack, which uses open-source technologies to achieve better efficiency and lower costs. (14:32)
Alexey: Fivetran, Snowflake, and Looker aren’t open source, right? (15:56)
Adrian: Correct, but you can build similar stacks with open-source tools. (16:04)
Alexey: What trends do you think we’ll see more of in 2025 and beyond? (16:15)
2025 trends: AI integration in data engineering and Apache Iceberg adoption
Adrian: AI will continue to grow, with more complex use cases and fewer hallucinations. AI is entering the data engineering space, with data engineers building AI agents. (16:40)
Adrian: Another trend is Apache Iceberg. It’s moving from hype to production deployments, especially with Pythonic implementations. (16:40)
Alexey: Can you explain what Apache Iceberg is? (18:08)
Apache Iceberg explained: table format, Parquet storage, vendor lock-in reduction
Adrian: Iceberg is a table format that simulates the storage layer of a SQL database. It’s a way of storing data independently of databases, allowing updates without rewriting entire files. (18:17)
Alexey: So it’s a bunch of Parquet files on S3 with metadata? (19:03)
Adrian: Yes, it’s similar to Delta and Hudi. The industry is excited because it breaks vendor lock-in, but vendors are still trying to capture value through catalogs. (19:11)
Alexey: You mentioned databases have four layers: storage, compute, access, and metadata. Can you clarify what a catalog is? (21:10)
Database layers and catalog role: storage, compute, access, metadata & lineage
Adrian: A catalog maps data to compute and manages access control. Some catalogs also handle metadata, like lineage, which helps track data usage. (21:27)
Alexey: What tools are used for catalogs? (23:36)
Metadata and catalog tooling overview (AWS Glue and peers)
Adrian: I don’t recall all the names, but tools like AWS Glue Catalog are examples. (23:41)
Alexey: What about DuckDB? Will we see more “on-laptop” data warehouses? (23:47)
DuckDB impact: embeddable local OLAP and portable query engine
Adrian: DuckDB is amazing and a key technology for us. It’s embeddable, meaning you can use it as a building block in your own product. (25:58)
Adrian: We use DuckDB in DLT to query data through a universal interface, whether it’s a file system, data lake, or SQL database. (25:58)
Alexey: Is DuckDB changing how we do data engineering? (27:32)
Cost-efficient pipelines: DuckDB with GitHub Actions and headless table formats
Adrian: Absolutely. People are challenging the high costs of vendor solutions. I’ve seen setups using DuckDB and GitHub Actions to run entire data stacks for cents per month. (27:40)
Alexey: So DuckDB allows us to process data locally and save results back to storage? (28:23)
Adrian: Yes, and because it’s portable, you can take advantage of cheaper compute options like GitHub Actions. (28:44)
Alexey: Is DuckDB related to headless table formats? (29:26)
Adrian: Yes, DuckDB provides a local access layer, which is useful for data pipelines. (29:33)
Alexey: You mentioned headless table formats with DLT. Is that where you’re heading? (30:18)
Headless table formats and DLT support for Delta Lake and Iceberg
Adrian: Yes, we’re already serving headless Delta Lake and working on similar solutions for Iceberg. (30:31)
Alexey: Instead of processing everything in Snowflake and paying a fortune. (31:23)
Adrian: Exactly. (31:28)
dbt's influence on engineering workflows and alternatives like SQLMesh
Alexey: How did dbt change data engineering? (31:29)
Adrian: dbt changed how people think about data engineering. It eliminated boilerplate code and improved project quality. (32:52)
Alexey: What about alternatives like SQLMesh? (34:08)
Adrian: Competition is healthy. dbt was first, but tools like SQLMesh are doubling down on what dbt core offered. (34:14)
Alexey: What about workflow orchestration? What should we use in 2025? (35:21)
Workflow orchestration options in 2025: Airflow, Prefect, Dagster, GitHub Actions
Adrian: It depends on your team. Airflow is a common choice, but tools like Prefect and Dagster are also popular. We often use GitHub Actions for its cost efficiency. (35:37)
Alexey: For simple workflows, GitHub Actions is sufficient? (37:00)
Adrian: Yes, it’s serverless and much cheaper than always-on orchestrators. (37:08)
Alexey: Let’s move to audience questions. (37:21)
Adrian: For someone pursuing multiple disciplines like data engineering, data science, and AI engineering, I recommend focusing on one area first. (37:38)
AI engineering convergence: data engineers building AI agents
Alexey: AI engineering often overlaps with data engineering, especially when building systems like chatbots. (38:02)
Adrian: AI requires data, algorithms, and semantics. Data engineers will play a key role in building AI agents. (38:38)
Alexey: With so many tools in data engineering, how should beginners choose what to learn? (40:13)
Beginner roadmap: SQL, Python, capturing business requirements, building a portfolio
Adrian: Focus on understanding the concepts and solving problems. Tools are secondary. Learn SQL, Python, and how to capture business requirements. (41:06)
Alexey: For building a portfolio, is it okay to pick any tool for each component? (43:03)
Adrian: Yes, but consider the end user. Tools like notebooks or dashboards should align with how your audience interacts with data. (43:28)
Alexey: So, try things and see what works. (44:38)
Tool selection guidance and vendor caution for modern data stacks
Adrian: Exactly. The modern data stack is interchangeable, but be cautious of vendors with questionable practices. (44:42)
Alexey: How challenging is it for a senior backend engineer to transition into data engineering? (45:11)
Transition paths: senior backend engineers moving into data engineering
Adrian: It’s relatively easy. The main gap is understanding the business case and requirements. (45:56)
Alexey: So, a senior backend engineer can transition into a senior data engineering role. (47:04)
Adrian: Yes, unless they’re aiming for something highly specialized. (47:17)
Alexey: If you want to position yourself as a Spark expert, it might take time, but solving data problems is more about engineering skills. (47:32)
Adrian: If you’re a junior claiming to be a Spark expert, I’d be skeptical. (47:45)
Job market outlook: senior vs junior data engineering opportunities
Alexey: What’s the job market like for data engineers? (48:04)
Adrian: For seniors, there’s no shortage of jobs. Juniors might need to explore related roles like BI or data science to get started. (48:04)
Alexey: What are Delta and Hudi? (49:00)
Table format comparisons: Delta, Hudi, and Iceberg differences
Adrian: They’re similar to Iceberg but differ in design and optimization. Delta is the most mature, while Hudi is more specialized. (49:42)
Alexey: Are these formats suitable for streaming (50:40)
Streaming architectures and tools: micro-batching, Kafka, SQS, Flink
Adrian: Yes, but streaming is often just micro-batching unless you have strict SLAs. (51:19)
Alexey: What tools are used for streaming? (52:31)
Adrian: Kafka and SQS are common buffers. Downstream, tools like Flink or DuckDB can process the data. (52:31)
Alexey: Will data engineering be automated by AI? (54:03)
AI-driven commoditization and code generation in data engineering
Adrian: AI is speeding up commoditization, making it easier to generate code. Data engineers will specialize further and build AI agents. (56:15)
Alexey: What’s the future of DLT? (58:20)
DLT roadmap: DLT Plus and a marketplace for reusable data products
Adrian: In one year, we’re focusing on DLT Plus, a portable data platform. In five years, we aim to create a marketplace for data products, enabling reuse across organizations. (59:42)
Episode wrap-up and key takeaways
Alexey: It’s been great talking to you, Adrian. Thanks for sharing your insights. (1:01:19)
Adrian: Thanks, Alexey. See you later. (1:02:15)
Alexey: Goodbye, everyone. (1:02:17)