Data Engineering Zoomcamp: Free Data Engineering course. Register here!

DataTalks.Club

Trends in Data Engineering

Season 20, episode 3 of the DataTalks.Club podcast with Adrian Brudaru

Links:

Did you like this episode? Check other episodes of the podcast, and register for new events.

Transcript

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Introduction and Adrian’s background

Alexey: This week, we’ll talk about trends in data engineering. Our special guest today is Adrian, a returning guest. This is his third time on the podcast, but he’s also been part of workshops and open-source demos. Many of you probably know him already. (1.0)

Alexey: Adrian is the co-founder of DLT Hub, the company behind DLT, which we’ll discuss today. We’ll also talk about trends in data engineering. When we launched our Data Engineering Zoom Camp recently, one of the questions from participants was, “How do you see data engineering evolving, and what are the trends?” I realized I couldn’t answer that, so I thought of Adrian. (1.0)

Alexey: Welcome to the podcast, Adrian. It’s a pleasure to have you here again. (1.0)

Adrian: The pleasure is mine. I don’t have all the answers, but I can share my perspective and observations. (2:23)

Alexey: That’s what makes it interesting. Everyone has opinions, and you’re much closer to data engineering than I am. (2:35)

Alexey: As always, the questions for today’s interview were prepared by Johanna Bayer. Thank you, Johanna, for your help. (2:35)

Alexey: Before we dive into trends, let’s start with your background. Can you briefly share your career journey? (2:35)

Adrian’s career journey and founding DLT

Adrian: Sure. I started in data in 2012. I spent five years working at startups in Berlin, building data warehouses. Then I joined an enterprise but didn’t enjoy it, so I became a freelancer for five years. (3:10)

Adrian: After freelancing, I wanted to do more than just consulting. I didn’t enjoy managing an agency, so I decided to build DLT, a tool I wish I had as a data engineer. That’s the short version. (3:10)

Alexey: You mentioned DLT is a tool you wish you had as a data engineer. Do you use it often? (3:56)

Adrian: I use it all the time. As a data engineer, much of your work involves data ingestion—taking data from semi-structured formats like JSON and transforming it into a structured format. DLT automates this process, so there’s no reason to use anything else. (4:03)

Alexey: Do you remember when we last spoke on the podcast? (4:26)

Adrian: Just over a year ago, I think. (4:31)

Alexey: Yes, over a year ago. You were already at DLT, and we talked about your journey from freelancing to founding a company. (4:33)

Alexey: Since then, a lot has changed. I’ve enjoyed visiting your office and seeing your team grow. You’ve moved to a bigger space, and it’s been amazing to witness your progress. (4:33)

Alexey: From an outsider’s perspective, it’s been exciting. Can you share what’s been happening internally over the past year? (4:33)

Adrian: We started by squatting in other people’s offices, but now we’ve grown significantly. Last year, I talked about creating a standard for data ingestion with DLT, and I think we’ve succeeded. (5:53)

Adrian: DLT has become a standard for Python-based data ingestion. We’ve commoditized this market, making it accessible to everyone. This has challenged the industry to offer more value beyond just connectors. (5:53)

Adrian’s current role and focus on DLT Plus

Alexey: Are you still actively working as a data engineer, or what does your role look like now? (7:33)

Adrian: Right now, I’m focused on DLT Plus, figuring out how to make it a meaningful product. We’re building a data platform around DLT, incorporating best practices and innovation. (7:45)

Alexey: So your work involves figuring out how DLT Plus will function? (8:47)

Adrian: Yes, as a founder, my role involves exploring initiatives, listening to feedback, and testing ideas. I’m also working on building a partnership network to maximize value for customers. (8:55)

Alexey: For freelancers listening, how can they reach out to learn more about this partnership network? (9:50)

Adrian: They can reach out on LinkedIn or find partnership links on our website. This is mainly for people already using DLT. (10:07)

Changes in the data engineering industry over the past five years

Alexey: You’ve been in the data engineering industry for over a decade. How has the industry changed over the past five years? (10:21)

Adrian: Five years ago, there was a shortage of data engineers, and anyone who could do basic integration was called a data engineer. Now, the field is becoming more specialized. (11:03)

Adrian: We’re seeing data engineers focus on areas like data governance, data quality, and streaming. These specializations often have little overlap, driven by industry requirements like handling sensitive data or new use cases like energy management. (11:03)

Alexey: For junior data engineers, is it harder to enter the field now? (12:17)

Adrian: It depends. There’s still opportunity, especially in AI. Many companies are exploring AI and need data engineers to support those efforts. Startups also need help building modern data stacks, which has become easier with tools like DLT. (12:37)

The modern data stack and its evolution

Alexey: You mentioned the modern data stack. How modern is it, and what does it entail? (14:08)

Adrian: The modern data stack is largely marketing. It’s a package of software from various vendors that you can combine to build a data platform. (14:32)

Adrian: For example, Fivetran with Snowflake and Looker. Vendors needed a way to sell together, and the modern data stack concept was born. It’s effective marketing, but not necessarily modern. (14:32)

Adrian: Now, people are talking about the postmodern data stack, which uses open-source technologies to achieve better efficiency and lower costs. (14:32)

Alexey: Fivetran, Snowflake, and Looker aren’t open source, right? (15:56)

Adrian: Correct, but you can build similar stacks with open-source tools. (16:04)

Alexey: What trends do you think we’ll see more of in 2025 and beyond? (16:15)

Adrian: AI will continue to grow, with more complex use cases and fewer hallucinations. AI is entering the data engineering space, with data engineers building AI agents. (16:40)

Adrian: Another trend is Apache Iceberg. It’s moving from hype to production deployments, especially with Pythonic implementations. (16:40)

What is Apache Iceberg and its role in data engineering

Alexey: Can you explain what Apache Iceberg is? (18:08)

Adrian: Iceberg is a table format that simulates the storage layer of a SQL database. It’s a way of storing data independently of databases, allowing updates without rewriting entire files. (18:17)

Alexey: So it’s a bunch of Parquet files on S3 with metadata? (19:03)

Adrian: Yes, it’s similar to Delta and Hudi. The industry is excited because it breaks vendor lock-in, but vendors are still trying to capture value through catalogs. (19:11)

Understanding database layers: storage, compute, access, and metadata

Alexey: You mentioned databases have four layers: storage, compute, access, and metadata. Can you clarify what a catalog is? (21:10)

Adrian: A catalog maps data to compute and manages access control. Some catalogs also handle metadata, like lineage, which helps track data usage. (21:27)

Tools for data catalogs and metadata management

Alexey: What tools are used for catalogs? (23:36)

Adrian: I don’t recall all the names, but tools like AWS Glue Catalog are examples. (23:41)

DuckDB and its impact on data engineering

Alexey: What about DuckDB? Will we see more “on-laptop” data warehouses? (23:47)

Adrian: DuckDB is amazing and a key technology for us. It’s embeddable, meaning you can use it as a building block in your own product. (25:58)

Adrian: We use DuckDB in DLT to query data through a universal interface, whether it’s a file system, data lake, or SQL database. (25:58)

Alexey: Is DuckDB changing how we do data engineering? (27:32)

Adrian: Absolutely. People are challenging the high costs of vendor solutions. I’ve seen setups using DuckDB and GitHub Actions to run entire data stacks for cents per month. (27:40)

Alexey: So DuckDB allows us to process data locally and save results back to storage? (28:23)

Adrian: Yes, and because it’s portable, you can take advantage of cheaper compute options like GitHub Actions. (28:44)

Headless table formats and their significance

Alexey: Is DuckDB related to headless table formats? (29:26)

Adrian: Yes, DuckDB provides a local access layer, which is useful for data pipelines. (29:33)

Alexey: You mentioned headless table formats with DLT. Is that where you’re heading? (30:18)

Adrian: Yes, we’re already serving headless Delta Lake and working on similar solutions for Iceberg. (30:31)

Alexey: Instead of processing everything in Snowflake and paying a fortune. (31:23)

Adrian: Exactly. (31:28)

How dbt changed data engineering

Alexey: How did dbt change data engineering? (31:29)

Adrian: dbt changed how people think about data engineering. It eliminated boilerplate code and improved project quality. (32:52)

Alexey: What about alternatives like SQLMesh? (34:08)

Adrian: Competition is healthy. dbt was first, but tools like SQLMesh are doubling down on what dbt core offered. (34:14)

Workflow orchestration tools in 2025

Alexey: What about workflow orchestration? What should we use in 2025? (35:21)

Adrian: It depends on your team. Airflow is a common choice, but tools like Prefect and Dagster are also popular. We often use GitHub Actions for its cost efficiency. (35:37)

Alexey: For simple workflows, GitHub Actions is sufficient? (37:00)

Adrian: Yes, it’s serverless and much cheaper than always-on orchestrators. (37:08)

Alexey: Let’s move to audience questions. (37:21)

Adrian: For someone pursuing multiple disciplines like data engineering, data science, and AI engineering, I recommend focusing on one area first. (37:38)

Alexey: AI engineering often overlaps with data engineering, especially when building systems like chatbots. (38:02)

Adrian: AI requires data, algorithms, and semantics. Data engineers will play a key role in building AI agents. (38:38)

Building a portfolio and choosing tools as a beginner

Alexey: With so many tools in data engineering, how should beginners choose what to learn? (40:13)

Adrian: Focus on understanding the concepts and solving problems. Tools are secondary. Learn SQL, Python, and how to capture business requirements. (41:06)

Alexey: For building a portfolio, is it okay to pick any tool for each component? (43:03)

Adrian: Yes, but consider the end user. Tools like notebooks or dashboards should align with how your audience interacts with data. (43:28)

Alexey: So, try things and see what works. (44:38)

Adrian: Exactly. The modern data stack is interchangeable, but be cautious of vendors with questionable practices. (44:42)

Alexey: How challenging is it for a senior backend engineer to transition into data engineering? (45:11)

Adrian: It’s relatively easy. The main gap is understanding the business case and requirements. (45:56)

Alexey: So, a senior backend engineer can transition into a senior data engineering role. (47:04)

Adrian: Yes, unless they’re aiming for something highly specialized. (47:17)

Alexey: If you want to position yourself as a Spark expert, it might take time, but solving data problems is more about engineering skills. (47:32)

Adrian: If you’re a junior claiming to be a Spark expert, I’d be skeptical. (47:45)

Alexey: What’s the job market like for data engineers? (48:04)

Adrian: For seniors, there’s no shortage of jobs. Juniors might need to explore related roles like BI or data science to get started. (48:04)

Delta and Hudi: alternatives to Apache Iceberg

Alexey: What are Delta and Hudi? (49:00)

Adrian: They’re similar to Iceberg but differ in design and optimization. Delta is the most mature, while Hudi is more specialized. (49:42)

Alexey: Are these formats suitable for streaming (50:40)

Adrian: Yes, but streaming is often just micro-batching unless you have strict SLAs. (51:19)

Streaming data tools and their use cases

Alexey: What tools are used for streaming? (52:31)

Adrian: Kafka and SQS are common buffers. Downstream, tools like Flink or DuckDB can process the data. (52:31)

The impact of AI on data engineering roles

Alexey: Will data engineering be automated by AI? (54:03)

Adrian: AI is speeding up commoditization, making it easier to generate code. Data engineers will specialize further and build AI agents. (56:15)

The future of DLT and its role in the data ecosystem

Alexey: What’s the future of DLT? (58:20)

Adrian: In one year, we’re focusing on DLT Plus, a portable data platform. In five years, we aim to create a marketplace for data products, enabling reuse across organizations. (59:42)

Alexey: It’s been great talking to you, Adrian. Thanks for sharing your insights. (1:01:19)

Adrian: Thanks, Alexey. See you later. (1:02:15)

Alexey: Goodbye, everyone. (1:02:17)

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.