Data Engineering Zoomcamp: Free Data Engineering course. Register here!

DataTalks.Club

Season 23, Episode 2

Foundations of Analytics Engineer Role: Skills, Scope, and Modern Practices | Juan Manuel Perafan

Show Notes

Links:

Timestamps

Click any timestamp to jump to that moment in the video

Transcript

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Juan’s psychological research and transition to data

Alexey: Welcome to this podcast interview. I want to ask you to tell us a few words about yourself shortly and tell us about your career journey. How did you start and what did you do? Maybe you can tell us what you studied and how you ended up doing what you do now? (0.0)

Juan: I am Juan. I was born and raised in Colombia but I have been living in the Netherlands for very soon fifteen years. It has been quite a journey. Professionally speaking, I studied psychology ten plus years ago. Back then I was super interested in psychological research in experimentation and stuff like that. (19.0)

Juan: The truth is I realized before graduating that maybe the field of data and the field of programming encapsulated more what I wanted from a career than doing psychological research. Somewhere in my masters I decided to start learning programming and that led me to this rabbit hole. (31.0)

Alexey: Just a quick thing because there is different kind of psychology. In your case were you doing studies like conducting correct experiments? (1:09)

Juan: I was not a how are you feeling and psychological problem type of person. I really enjoyed conducting experiments as in connecting people to an electronaloggram or fMRI or TSM or all sorts of technologies to study the brain. (1:20)

Alexey: You had to know statistics quite well for that to conduct these studies? (1:39)

Juan: Actually yes. It is very different because I think it is funny when you are doing statistics on a very small sample size of eighty people, things like inference become a lot more important. When you have streams, event data or millions of clients, it is not about inference. It is about you already having a lot of the population down there. (1:51)

Juan: Technically that was a soft introduction to data but the truth is eventually I worked my way into building dashboards. That was probably my first career. I did that for a couple of years. I used to be a Tableau consultant until I realized that what I really wanted to do was everything database related. (2:16)

Juan: I really like everything from data modeling, writing SQL, even the administration part I thought was quite interesting. For the last seven eight years I have been doing mostly SQL work. What is really interesting about analytics engineering in particular is back in 2020. We are going to go through the history of analytics engineering but for context the first time somebody coined that term was mid 2019. (2:38)

Juan: In early 2020, a recruiter reached out to me for a company where I actually already applied as a data engineer. Back then the conclusion was that you are using too much analytics to go through this rabbit hole of data engineering. He reached out to me and said he had the perfect role for me called analytics engineering. Back then I had no idea what it meant. (3:14)

Juan: It was a really cool opportunity. I was the first analytics engineer in a company back in 2020. Back then on Dutch LinkedIn, there was absolutely nobody else with that job title. I always said I am definitely not the best analytics engineer out there, but I was certainly one of the first who took it on themselves to co build this whole thing. (3:39)

Juan: It has been really nice because at the very first years like 2021 even 2022 a big part of the role was to educate the markets on what exactly this is, how we do it, the technologies, and why this is valuable. I had to explain it to my own colleagues. Back then I founded the first analytics engineering meetup. I became the co founder of the Dutch DBT meetup as well. (4:08)

Riding the wave: The early days of analytics engineering

Juan: Even Packt reached out to me to write a book. I wrote it with some of my colleagues from back then. It is called Fundamentals of Analytics Engineering. It was a little bit of being at the right place at the right time. Just being able to ride that wave of analytics engineering so early brought up quite a lot of very interesting opportunities. (4:33)

Juan: It is a very interesting role. Even now preparing for this webinar this whole week I have been watching all sorts of YouTube videos of people discussing it. You see everything from people with takes that I do not fully agree with to people with takes that surprise me. There is lots of skepticism. (5:01)

Juan: I think this is an amazing opportunity to discuss a little bit what it is and what it isn't. I want to explain to everybody why it is an interesting career choice. Hopefully that gets them excited for module number four of the upcoming data engineering zoom camp. (5:28)

Alexey: The story of analytics engineer started with Victoria who is the instructor we mentioned. When we just started the Data Talks Club Slack I had this habit of talking to everyone who joined. She was among the first few hundred people. Since it wasn't super huge volume I was able to talk to everyone and she was one of these few people. (5:41)

Alexey: I was asking her what she did and she was basically describing analytics engineering. I wondered what the hell that was. I know who analysts are and I know who data engineers are but this is something new. This encounter that we met through Slack was very useful because when we started the data engineering course we run a poll. (6:13)

Alexey: Before we started the course we run a poll what people want to see in the course and the top wanted thing was analytics engineering. People wanted to see analytics engineering as a chapter as a module in the data engineering course. We were wondering what that was. Victoria back then volunteered to take the lead on that part which was pretty cool. (6:40)

Alexey: From what I understood back then, analytics engineering is the role that sits between analysts and engineers. This is the analyst who can do a bit of engineering who doesn't need to wait for engineers to implement some things but also has the capacity of an analyst to explain different things to stakeholders. This is a person in between and the main tool for doing that for analytics engineers is DBT. How correct this assessment is in your opinion? (7:10)

Juan: It is a short summary but I do think there are some nuances that are very often misinterpreted which we can build on. Before I start explaining what it is, I think it is really interesting to maybe share the story. (7:43)

Breaking down the gap between analysts and engineers

Juan: The reason I am here is back when I was on paternity leave a couple of years back I actually followed the zoom camp myself. I probably learned most of the Terraform that I know from you. Back then I was already really good at DBT, but I did see Victoria's course. I thought it was so interesting that I reached out to her and asked if she wanted a copy of the book. (7:56)

Juan: We were really curious what she would think. That is a little bit of how it started. She is a really interesting profile. Summarizing analytics engineering, the first thing that I like to mention is as a field we make the mistake of defining it by what it isn't. (8:20)

Juan: Quite often when you are asked what an analytics engineer is, you hear something about bridging the gap between analyst and data engineer. In fact, my first real definition used to be anything that a data engineer and an analyst cannot or doesn't want to do. That is not that great to let yourself be defined by. (8:42)

Alexey: I thought people just wanted to have analysts who also are good at engineering. Data engineers don't do this? The reason I have this idea is because I was a data scientist. As a data scientist, we don't like engineering or deployment. (9:06)

Alexey: On average we don't like deploying models or going deep and thinking why builds take forever or optimizing things. Engineers were always busy. I wanted to deploy something but then I needed somebody who could help me with Kubernetes and they were always busy. So I had to learn Kubernetes myself to deploy it. (9:24)

Alexey: I thought that maybe analytics engineer appeared in the same way. Analysts were forced to wait for data engineers to finish something but they were busy with something else. They thought about just going ahead and learning whatever needed to happen to get data from the raw source to data warehouse so that they can continue working. (9:54)

Juan: That is literally how I started. In my first job when I was building dashboards the DBA was not very pleasant and he was always super busy. Very quickly I learned that the better I become with SQL, Python and doing my own shadow IT in my own computer, the less I need from the external IT team. In my case definitely it was a reaction to the needs. (10:25)

The art of turning business reality into clean data

Juan: Going back to what exactly it is, I think it is nice to break it into two pillars. The first one is the what do we do. That is very simple if you can summarize it in one phrase as taking the business reality and making sure that the data resembles that. (11:03)

Juan: Some people call it warehousing, data modeling, cleaning data sets or preparing data sets for insights. Being honest, this activity of turning data into something that can be used for insights has existed since the seventies, maybe earlier back when they had punch cards. One of the big things that the skeptics of this role say is that they have been doing this since the eighties, so what is the difference? (11:15)

Juan: To me, the difference is not the what, but the how. That is very similar to what happened with software and data engineering. The emphasis started going more on the engineering part. We used to have a landscape in data back in the nineties and two thousands that was the Wild West. (11:45)

Juan: There was a lot of making sure that things work. I think as a field data has learned a lot from software engineering and DevOps. Now there is a very big emphasis in being rigorous, building things that are robust, being efficient, and making things that scale. I like to mention that as a field we were too busy building bigger and faster cars and it is only now that the emphasis has shifted into making things safe. (12:09)

Juan: The short version is an analytics engineering does a lot of analytics tasks which generally speaking is data modeling. That also implies a bit more than that. They always do it by applying software engineering best practices similar to how data engineers are doing it. You could say it is glorified data cleaning, but in reality there is rigor and a scientific way of making things replicable. (12:47)

Juan: The how we exactly get things done is almost as important as what we actually do. (13:18)

Alexey: So analytics engineering has two components: analytics and engineering. Analytics is data modeling and other things, and engineering is applying best engineering practices to that. What exactly is the analytics part? My experience with analysts might be a bit limited, but from what I saw, analysts have very good domain knowledge and business expertise. (13:25)

Alexey: They talk to product managers and they help to understand where the company should move using data driven approaches. If we are implementing a new feature, they define the success criteria for this feature. It should grow the amount of users or the engagement of users. Typically analysts help to define what exactly these success criteria are. (14:03)

Alexey: Sometimes they also do ad hoc analytics like checking how many users we have from a specific country. You say you as an analytics engineer do a lot of analytics which is data modeling. This is not what I saw these people were doing. Maybe I am not correct. What exactly do you mean by data modeling and why do you call this analytics? (14:38)

Juan: The thing is that what exactly is implied under the umbrella of analytics tends to get a bit fuzzy. Some people that have this hat of analytics engineer are just what you would call some years ago data warehouse developers. Now some people actually resemble more what you would call a BI engineer who prepares data sets for consumption. I still think the difference there is why this is not just a regular analyst job. (15:03)

Juan: The philosophy and the way of doing it is a bit different. In the specific case of an analyst, I have seen people saying they have been a BI engineer for years building PowerBI or Tableau data sets since the two thousands. Why do you feel like you need to call it analytics engineering? I think although the tasks feel very similar, philosophically there is a big difference. (15:48)

Why data engineering is about safety, not just speed

Juan: If you check the Tableau or PowerBI sites, you are going to see a big emphasis on speed of delivery. Inside that is the speed of thought and building dashboards super quickly. You are going to see also some emphasis in ease of use, analytics for everybody, and making analytics accessible. When you go to anything engineering analytics or data engineering the emphasis changes more to testability and robustness. (16:25)

Juan: You want things that do not break. This is super apparent when you compare the marketing language between DBT and Tableau. There is of course a market for doing things quickly. There is a market for ease of use, self service, giving data to everybody and empowering them. That is still a very valuable activity. (16:52)

Juan: But for some use cases that robustness is definitely worth having as a main thing. Some analysts in particular do not have the technical skills because there are some analyst jobs that are ninety five percent business expertise and five percent Excel. Even the ones that are very technical inclined and very good at programming are still a bit too caught up with this philosophy of speed of delivery. They haven't really internalized things like the importance of testing your data. (17:12)

Juan: You can understand where this is coming from because something happened in our production system and we want to understand the effect it has. Typically an analyst would set up a dashboard showing the impact. In this case you want to do this quickly. You don't want to spend two days building this thing. You want to have it now. (18:00)

Juaney: This is important for some things, but for some other things, you'd rather take things slower? (18:24)

Juan: Correct. There is one thing that a lot of people don't acknowledge: there are a lot of analyses that are one use only where having them done quickly is maybe more important than always being right. Sometimes having numbers that are wrong can make a company lose millions and people can get harm from it. (18:35)

Alexey: I imagine if a CEO comes and says he has a board meeting tomorrow and needs these numbers quickly because of an important strategic meeting. If an analyst makes a mistake which is a few millions off, then that is a problem. Maybe for this sort of ad hoc request analysts need to have more time? (19:00)

Juan: For sure. That is a big one. I think the difference in analytics shows even when the analyst is really good. I have met some analysts that are really good at working with Python and SQL where you might think they have the right qualifications, but it still feels a bit different. (19:13)

Juan: No worries. I thought the place where you might want to go next is what is the difference between this and a data engineer? (19:46)

Alexey: No, I was curious about data modeling actually. You said analytics engineers do a lot of analytics which is data modeling. I was curious what exactly data modeling is. I come from the software engineering background which is what I was doing before data science. There the biggest problem was domain driven design. (19:58)

Reimagining data modeling in the modern era

Alexey: You have some domain entities and then you express them with classes and object oriented programming. The challenge is you have these things and you want to map them to a database which is tables. How do you express all this complexity of your domain in table terms? This is difficult because you want to have the flexibility of object oriented programming but you want to save the data to a relational database. (20:21)

Alexey: I wonder what data modeling is in your case? (21:00)

Juan: I make the mistake of assuming that everybody is familiar with the term but the truth is there are at least five or six different definitions of model. Even what a model is means a different thing for a data scientist. Removing the word modeling, I am referring to exactly what you just mentioned. How exactly do you design a database? What constitutes a table? (21:08)

Juan: What is each row supposed to represent? How do you turn information into a tabular format? That all constitutes data modeling. It is kind of funny because some people like to say that data modeling was dormant for a lot of years. Back in the seventies until the nineties space in a database was such a big constraint that lots of the data modeling was to optimize space. (21:42)

Juan: Normalizing a table to the highest level tends to be very impractical for an analyst but it does save space. (22:09)

Alexey: I remember normal forms. (22:22)

Juan: Correct. Definitely if anybody listening to this is interested in how to organize databases, googling the concept of database normalization can be really enlightening. We studied this at university. That is really what got me into this whole SQL thing. Back before the nineties people were busy with how to reduce space and make sure databases are as compact and efficient as possible. (22:28)

Juan: The two thousands came with Hadoop and early Spark and people thought they did not care because they could brute force this problem with compute. Lately there has been this movement into realizing that maybe compute or space is not a huge constraint. We should start remodeling data. We are reorganizing data inside of a database into something that is easier for humans to maintain and meaningful for the business. (22:58)

Juan: There is nothing nicer for a business than to query a table called clients where all of the information of a client is in there regardless of the background. It might just be seven source systems that you tried to Frankenstein together. If a database has exactly the same names that a business stakeholder expects and if it is very simple and straightforward what each column is supposed to represent, then you are doing things really well. (23:41)

Juan: Bringing that business reality into a database is probably still a big point in many organizations? (24:15)

Alexey: For sure. One of the most common gigs that I have done recently at my last company was an online marketplace. The way it was growing was by acquiring different marketplaces around the world. Each of these marketplaces were in different countries like Ukraine, Poland, Argentina, South Africa, or India. They put them all together but these were completely separated systems. (24:20)

Alexey: Now they needed to somehow unite. Imagine how difficult it would be to have one unified data table for all the items. (25:02)

Juan: What a lot of people don't realize is that turning both of those into one single client table is quite a task. It requires lots of librarian work as in understanding all of the assets and classifying them to know what you really have. It is a bit of a business task because you need to know which information is going to be useful. Can you explain to me how this works? (25:14)

Juan: This requires lots of mediation and stakeholder management. If you do data modeling you have to be in a room often with very important people in an organization, sometimes even sea level. You are asking them how we should conciliate this data problem. That requires a very specific profile of person just to be able to do it. (25:52)

Juan: Most of you inferred that it also requires some SQL knowledge and some knowledge about data and databases. (26:23)

Alexey: Previously I think this job was done by data engineers. Is it not? (26:30)

To split or not to split: Finding the right team roles

Juan: That is another one of the big skepticisms. A couple of months back I saw a posting saying analytics engineering is dead because it is a data engineer job. The truth is I don't care because whatever works for your organization works. The reality is that before 2019 when this term was coined, more often than not this task was assigned to the data engineering team. (26:53)

Juan: Maybe even today you might be a data engineer still doing this part. Objectively in some organizations it makes a lot more sense to split this into two different roles. It is possible that in your organization it is fine that the data engineering team does both of these tasks. One is data infrastructure management and the other is data modeling. (27:06)

Juan: It has worked for a lot of years but the truth is the expertise that you need for both tends to get large. Even if you find a person that is capable of doing both of those tasks, sometimes it is a matter of preference. I have met some data engineers that literally tell me they just want to hide behind their computer. If it is a matter of asking a sea level or a stakeholder how something works, they want someone else to do it. (27:38)

Juan: They want the scrum master or whoever it is to make sure they do it and not bother them with that. Even though a lot of data engineers are capable and have been doing this for ages, it is kind of nice to let some people handle data infrastructure and make sure all processes run. Some people do the modeling because it requires different expertise and different profiles. It is hard to find somebody that is equally qualified and interested in both. (28:12)

Alexey: Now I understand your definition or your comment at the beginning that analytics engineers do what analysts and data engineers don't want to do. I guess this is the part that data engineers don't want to do. They would rather move data in Kafka or focus on engineering work and performance metrics. Somebody else would take care of understanding how we merge these two tables. (28:42)

Juan: For sure. I know from experience sometimes it is a matter of not liking doing this. Do you like doing this? Yes. Okay. To me it is super important to have a very nice mix of technical stuff but also the human side. (29:22)

Juan: Having a nice excuse to understand the business is amazing to me. I used to be a consultant for a long time and I had some gigs that were one hundred percent technical and some that were one hundred percent talking to people. I don't think I enjoyed them as much as the ones where I had a mix of both. I think the role of analytics engineering really encapsulates that. (29:45)

Juan: It gives you a chance to talk to stakeholders and understand their business but also to nerd out about very technical stuff. You work with containers and think about how to optimize table partitions. It has a little bit of both. (30:09)

Python, SQL, and the technical toolkit for success

Alexey: Let's talk about the technical stuff. I see in the live chat that we have Mark Lamberti from Airflow who says he doesn't know enough about analytics engineering. (30:35)

Alexey: I see another comment from Salo Mayor wondering what they need to know from the technical point of view. Can you tell us what they need to know? You mentioned containers, Python, and databases. I guess modeling is one thing conceptually, but how do you actually optimize the databases and the tables? (30:45)

Juan: I try my best to keep this role fifty fifty. If you ask me what you need to learn, I am going to try my best to give you fifty percent things that are very technical and fifty percent things that are more behavioral or business. If we focus just on the technical, the lazy answer is going to be DBT. We are probably going to discuss later why exactly that is the lazy answer. (31:22)

Juan: If we pretend we live in a world without DBT, the answer would definitely be SQL. Modern data platforms like Databricks, Snowflake, BigQuery, or Trino are very interesting things to learn. Some other concepts like zero copy cloning and retention periods tend to come up more often. (31:54)

Alexey: What is zero copy cloning? (32:31)

Juan: I think most tools have it but I believe Snowflake was the one that started this. It is possible to have copies of your data that are purely semantic. Instead of actually having two exact copies of your data which essentially duplicates storage, it is like symlinks on your computer. (32:37)

Alexey: Like shortcuts where you have a big file in one folder and you want the same file in another folder. Instead of creating an entire copy of a one gig file, you create a symlink which points to the original file so the process is in the other folder? (32:57)

Juan: Yes. It is kind of like a pointer to the actual data plus a list of the differences between this data and that one. Now if the differences become too big, I believe some things happen in the background. But the point is there are quite a lot of features that come with modern warehouses that are very interesting. Definitely learning at least one of the modern ones is nice. (33:18)

Juan: I would dare to say Python is very important, specifically how people tend to DevOps code with Python. Learning frameworks like pytest is interesting, as is how people containerize Python work. (33:55)

Alexey: I want to interrupt you again. How exactly do you use Python? To me the work of analytics engineer seems to be having DBT so you have your source data and your data models. You want to transform data from a source to a destination and you use SQL for that. DBT takes care of orchestrating this whole thing and this is all done in SQL. Where does Python come into place? (34:07)

Juan: Back when I started my data career in very early data science, I started when there was still Python 2.7. Most people were hesitant to move to Python 3. Back then the understanding was SQL was nice for quick and dirty, but serious data work had to be done in Python or Spark. It has shifted because they brought a lot of the things that SQL was missing. (34:45)

Juan: This includes the capability of testing your SQL code and running it in different environments. As DBT just started they were SQL only, but now they support both SQL and Python models. Even if you don't do any of the data modeling work in Python, most of the tools that we work with have been built in Python or have a Python wrapper. (35:14)

Juan: You can install DuckDB, Airflow, or DBT simply with a pip install. Python is the glue that keeps multiple tools together. For example, other processes that are adjacent to data modeling like orchestration or ingestion are still done in Python. Even though Python is optional at the beginning, it is super useful. (36:00)

Juan: It puts data modeling work in the context of your whole organization. With a little bit of Python, you can ingest data somewhere or get a message when the data is done ingesting. You can create an API where people can get the data from it. (37:03)

Alexey: While talking about Python, I remember my times being a data scientist where I needed to build a lot of transformation pipelines to feed into the models. I did not like using SQL because for me it was hard to read, modernize, and test. Analysts really liked it because it is simpler. When you already have a data platform, you can just take your SQL stuff and run it. (37:23)

Alexey: You can schedule an Airflow job to run it on the warehouse. There are a lot of benefits to SQL. From what I understood, analytics engineers can use both. But there is also testing in SQL? Maybe you can tell us more about this. (38:10)

Juan: To be honest I still believe most of the work is still done in SQL, but it is worth mentioning DBT supports models in Python as well. (38:35)

How to stop manually testing your data dashboards

Juan: One of the reasons why when I first started people saw SQL as quick and dirty is because there was no good way of testing it. Whenever you test anything data specific, you always have to break it down into at least two components. One is testing your data and the other one is testing your code. (38:41)

Juan: Let's talk about testing the code because that is where DBT is excelling and really exceeded expectations compared to what we had before. I want to make it clear because somebody in the comments will say DBT does not do this. Quite often when I talk about it, it is because I lived through what we had before. (39:11)

Juan: I am not saying DBT is perfect. In twenty years our kids are going to laugh about how primitive our tools were. Back in 2018 I had a client and we were building a very important dashboard for them. It was exposed to many people through a web portal from a government. Their way of testing was a checklist of things that had to be checked manually before we published. (39:54)

Juan: At first it was fifteen items and testing took ten minutes. Then it got to seventy items and testing was very tedious and annoying. You had to make sure the colors were exactly these ones and no filters were selected by default. You had to check that if you filtered for this building at this time, it was the right value. (40:32)

Juan: Sadly because of the tools we were working with, all of that had to be done manually. Now I realize ninety nine percent of this could have been automated with something like pytest if we had chosen a BI tool that had a way to interact with dashboards using code. DBT allows you to test the SQL code in many ways. (41:04)

Juan: The first one is through what they call generic tests. A couple of years back they changed the terminology but let's keep it simple. What was called generic tests are little flags that you can add to the documentation of the field. You can say if it is unique, what the accepted values are, or if it should have a one to many relationship. (41:36)

Alexey: There is a library called Great Expectations. I think this is what it was doing. (42:11)

Juan: Exactly. With Great Expectations, you can expand this to all sorts of things. The values of this column should match this regex or should be between certain ranges. Definitely look for DBT Expectations to see all the ways that people test generically. You do that mostly at the column base. (42:17)

Juan: The second way that you test is at the table level. They call those singular tests which is writing a snippet of SQL. If it returns more than zero rows, it is an error. For example, I was working on a project with vehicles that left a warehouse. Every vehicle had to be accounted for twenty four hours of the day. (42:46)

Juan: Sometimes when we would do joins, we were missing a day and accidentally filtered out something. All of a sudden we didn't have data for this truck on this day or at these hours. That was a lifesaver because at least a couple of times we accidentally exploded with the join because of a many to many relationship. Singular tests told us there were more than twenty four hours or a vehicle was not accounted for. (43:22)

Juan: They also have unit tests that allow you to provide data. The best example is an email regex. You write a formula with a regex to determine if an email is valid. You pass it a list of emails and say if I give you this it should be true, and if I give you that it should be false. (44:01)

Juan: Those are common ways but the interesting part is you can also put DBT inside of your CI run. It is possible to have it so that whenever you are trying to push your code into GitHub, GitHub automatically runs the code. At the bare minimum it tells you this code is not running and I am not going to let you merge this to our codebase. (44:24)

Juan: You can expand it to say you won't merge if the code is very expensive or it doesn't match what you were explicitly told to match. It can be up to the data level where you say the KPI for this month in this country should be equal to this. If it is not matching, something went wrong. Looking back at that dashboarding project, I would have loved to automate those manual tests. (45:09)

Alexey: We don't cover testing or CI/CD in the course even though they are quite important. We cannot cover everything. These things are usually what software engineers already know. For somebody coming from an analytical background, these are new terms. If you are watching this and doing the course, these are some things you should check after the course. (45:57)

Bringing software engineering rigor to data workflows

Alexey: You create a project and then once it is complete, you think about how you do testing for this and how you run these tests automatically when you push code to GitHub. There are so many coding assistants around us like Codex, Copilot, or Claude Code. You can just ask them to assist you and you will learn along the way because Juan introduced these concepts. (46:34)

Alexey: At least for me, this is how I learn many things these days. I ask an agent to implement this and then I try to figure out what is happening there. I totally recommend trying this after the course. (47:25)

Juan: I recently built my own website. It is fascinating because I asked the AI to coach me on how to do it. Some parts I knew how to do, some parts I asked Claude to coach me and explain what happens in this situation, and some parts I just had it do for me. That synergy went so well and I am super happy with the end process. (47:43)

Juan: I think I could have mindlessly boded the whole thing but I also took the opportunity to get some coaching. I learned a thing or two about websites. Learning everything about GitHub and Git is very useful. Most people in the data world know the basics but I wish I was better at GitHub in general. (48:15)

Alexey: There is a question about what technologies the course covers. I included a link in the description. Just Google Data Engineer Zoomcamp and you will find an article and a GitHub repo with the answers. I don't want to spend a lot of time on that because we have questions from the audience. (48:50)

Alexey: The most important question is from someone exploring a move towards analytics engineering. What books, docs or resources helped you most when you were learning data modeling and analytics engineering? (49:29)

Must-read books and resources for mastering the craft

Juan: I am surprised that we have spent this long without talking about DBT. Let me get the book out of the way first. Fundamentals of Analytics Engineering. (49:50)

Alexey: People who are listening to this and not seeing the video, can you describe what you showed? (50:00)

Juan: The book I showed is Fundamentals of Analytics Engineering. I am one of the co authors. The selling point is that we try to make it evergreen. Each chapter is linked to one responsibility of an analytics engineer like data ingestion, data modeling or collaborating with code. We try our best to not give code examples that might be irrelevant in two years but to talk about it functionally. (50:05)

Juan: There are six or seven books that talk directly about analytics engineering. Ours does really well at being timeless. I am confident that it will still be relevant in five years. It is not a book about AI where things change in two months. Check everything DBT. (50:43)

Juan: DBT has amazing free courses and great documentation. I sometimes think about it like how you have to learn the piano if you want to learn music theory. Knowing DBT doesn't make you an analytics engineer, but it is the way you put in practice knowledge about data modeling, business skills, stakeholder management, and engineering best practices. It is a really good platform to get started and acquire adjacent skills. (51:20)

Alexey: To me it seems like analytics engineering and DBT are synonymous because we always talk about them together. Does it have to be this way? I know there are alternatives like SQLmesh. Are there other things? (52:11)

Juan: DBT also rode the wave early. In 2019 it was actually a blog called Locally Optimistic that coined the term analytics engineering. DBT started riding that wave very soon after and they really managed to become synonymous with the field. From an academic perspective I do believe this discipline can live outside of DBT. (52:29)

Juan: Technically you could implement engineering rigor to working directly with SQL. But DBT rode that wave early and tried to embody doing analytics with software engineering best practices. A lot of that has been built into their product. They are the ones advocating for this term and evangelizing it. They have spent lots on marketing to make sure people know it. (53:32)

Juan: Sometimes I see on LinkedIn that people say you don't need DBT to be an analytics engineer. There is some merit because they have spent a lot of time and effort in educating the community. They teach that if you are going to do analytics you should be following data ops or software engineering best practices. Their platform comes integrated with this so that it is very easy for you to do. (54:16)

Juan: Things like working with multiple environments, testing your data and documenting your data come very naturally to DBT. Try to find a nice balance between the tool and the concept. (55:01)

Alexey: If DBT is the piano, what is SQLmesh? (55:25)

The future of dbt and the shifting tool landscape

Juan: The guitar? It is a tricky one. DBT was there before SQLmesh. Let's go back to last summer. Lots of people were super interested in SQLmesh because they brought some ideas that DBT was not implementing. (55:42)

Juan: In Terraform, if you change code to remove an asset and do a Terraform apply, it notices that cloud resource is not in your code and deletes it. SQLmesh had something very similar in the sense that it had knowledge of the state of your cloud resources. With DBT, if you code something with a name, deploy it and then rename it, the old one is still going to be deployed. The cleanup is not that simple. (56:00)

Juan: SQLmesh had great ideas like being more declarative and using less Jinja. As of October, FiveTran acquired them. My initial reaction was that there would finally be a real competitor to DBT because FiveTran would invest. Then maybe two weeks after that they announced DBT and FiveTran merging. (56:46)

Juan: A merger is not finalized yet. It is possible that both DBT and their second best option are going to be under the umbrella of the same company. Both have open source alternatives, so nothing is stopping you from forking them and continuing forward with your own version. I am not a business VC expert, but it doesn't look good for SQLmesh. (57:30)

Alexey: I think they are going to merge them into one. That would be good actually if they do this. For the founders, maybe that was a good exit? (58:21)

Juan: I have no idea. But I think they did it wisely because if people did not want their company merged with FiveTran, migrating to SQLmesh was immediate. DBT projects were fully compatible with SQLmesh. Now they have both. The two that really embody analytics engineering the best now have some sort of connection with FiveTran. (58:27)

Alexey: I wanted to ask one question from Brian. We are taking more time than I planned. Brian is asking if it is relevant to build a data warehouse using data modeling now when we have all this lakehouse architecture, Iceberg, Dremio and things like that. (59:38)

Deciphering the lakehouse: Warehousing in the cloud

Juan: I don't see why they are opposites. Generally speaking, let's talk more about the concept of a lakehouse. (1:00:29)

Alexey: A data warehouse is where we keep our data. It is a database optimized for analytical queries. We cover it in our data engineering course. What is a lakehouse because we don't cover that? (1:00:41)

Juan: Data warehouses are databases for analytical purposes where data is properly modeled and integrated. Data lakes involve dumping data directly from a system somewhere. Nobody has taken the trouble to integrate that data. In a data lake, there often isn't a clients table. It is more like data coming from Google Analytics or the CRM. (1:01:10)

Juan: Lakehouses have essentially managed to decouple both. You can still dump the data as it lands in a data lake but you still have a business or logical representation on top that people can query. You can definitely do both and they don't have to contradict each other. Some people think just dumping all data from a source system into a folder is enough. (1:01:56)

Juan: That is the easy part. The difficult part is to put that data together into something that resembles the business. Reconciliating data from all sorts of systems and turning it into something obvious for people is the most difficult part. Data lake projects are a lot faster than data warehousing projects because that integration is so hard. (1:02:45)

Alexey: I want to summarize to understand. We have a meetup system where people can sign up for events. If I want to build analytics on top of people who come to our events, I can get a dump of all emails to a folder in S3. That would be our data lake. (1:03:12)

Alexey: I would have three folders for Meetup, Luma and LinkedIn with raw data dumped to S3. To query it and get insights, data engineers set up a process that pulls fresh data every day and puts it into a specific folder in the data lake. This data is accessible and I can use something like Presto to query it. (1:03:44)

Alexey: It is not a warehouse yet because it is just a bunch of JSON or Parquet files. In order for this to turn into a warehouse, we need to do data modeling. We need to model the table which would include the source and common column names like email. Then I can build a dashboard to understand which platform people are coming from. (1:04:30)

Alexey: The data warehouse could be Snowflake, BigQuery or ClickHouse. Analysts work on those, not data lakes. A lakehouse is the best of both worlds? (1:05:10)

Juan: It is the underlying data lake but on top of a logical representation that you can query with SQL like it was a data warehouse. Even though you can build stuff directly from querying a data lake, turning the data into something intuitive that speaks to people is a very valuable effort. It is a lot of work to turn all those source systems into something called a customer table. (1:05:37)

Juan: But that effort is worth it. In this year we are in this paradigm of modeling the data into something that is easy to maintain and intuitive for whoever is querying. For a new analyst or data scientist, it is going to be a lot more intuitive to know where to find things. (1:06:25)

Alexey: A lakehouse would be something between a data lake and a warehouse. Instead of putting things in BigQuery, we would create another folder in our data lake. We would already do all the data modeling and define the final representation of our data. Instead of saving it into BigQuery, we save it into a folder in our data lake. (1:07:04)

Alexey: All the analysts and data scientists could query this data lake folder. It will be slower because it has to fetch the data from S3, but we will have the benefits of having done the data modeling part already. The fields are clear and understandable and we can do analytics. It is just slower? (1:07:44)

Juan: Not necessarily slower. The whole lakehouse concept is something that Databricks invented for marketing. I think Lakehouse explains best the difference between analytics engineering and data engineering. Data engineers are the ones building the data lake by taking data from here and dumping it there. (1:08:24)

Juan: A lot of times data engineers don't want to do the business integration or bother with data quality. Analytics engineers take whatever gets dumped into the data lake and build the business logical representation. Some tools like Fabric or Databricks work that way. Snowflake also has a data lake integrated where they save data as files and you build the logical representation. (1:08:59)

Alexey: I see questions about the data engineering course. If someone has no experience, this course will help. Next week on Monday the course starts and I will share a link to the Luma event. Spend some time reading the course curriculum and come to our launch stream to ask questions. (1:10:00)

Alexey: What are good resources you recommend to someone who wants to learn data engineering involving data lakes? (1:10:53)

Pro-tips for starting your data engineering journey

Alexey: I cannot give any other recommendation but our Data Engineering Zoomcamp. Juan, do you know any other resources? (1:11:16)

Juan: Last year my role was to teach junior data engineers. Databricks has some really good free courses. If you don't know anything about data engineering, start at the data warehouse or lakehouse part. Learn Databricks or Snowflake. Astronomer has amazing courses if you want to learn about orchestration. (1:11:31)

Juan: The Linux Foundation has courses in Bash, Linux, Kubernetes and all cloud providers like Azure, GCP or AWS. Pick the ones specifically branded towards data such as the GCP Data Engineer Professional. Pick one. If you don't know where to start, go for Databricks. Plenty of companies are doing amazing free content. (1:12:24)

Alexey: BigQuery, Databricks or Snowflake? (1:13:25)

Juan: I used to be the biggest Snowflake fan until around 2022. In the last two years I am more team Databricks. I have always really liked Databricks. I just thought that before the pandemic, Snowflake's offering was a bit more solid. Databricks used to be a nightmare to administer with manual clusters and no oversight. (1:13:31)

Alexey: They were focusing on Spark primarily? (1:14:14)

Juan: Yes but now it is becoming a lot easier to work with. They are doing amazing stuff. It is a bit like comparing fridges. You are not going to buy a new fridge if a new model comes out that is ten percent better. (1:14:20)

The big debate: Databricks vs. Snowflake

Juan: This is relevant for people who haven't learned either or are looking for the next organization. Databricks has a more solid offer and I am enjoying working with them a bit more. But both are very solid. Snowflake has a project on an open semantic layer that is supposed to be universal. If that pans out, that could be massive. (1:14:40)

Juan: I am a huge fan of Unity and Delta. Snowflake might come up with something amazing like selecting star. The data landscape is so fast that in two months I could have a completely different opinion. (1:15:45)

Alexey: Did you try ClickHouse? (1:16:26)

Juan: No but I heard amazing things. Some people I really admire have recently been posting on LinkedIn that they are working for that company. I should check what those people are up to. I love DuckDB. (1:16:31)

Alexey: DuckDB also has MotherDuck? (1:16:52)

Juan: Yes, that is their paid offering. Most of my personal projects now use DuckDB. (1:16:58)

Alexey: For the course there are two setups: the cloud setup and the local setup. I changed the local setup from PostgreSQL to DuckDB because it is easier to set up. It is very lightweight and super easy to install. The UI looks very nice and sleek. They have full support for analytical SQL shortcuts like qualify. (1:17:18)

Alexey: Because it is so lightweight, you can spin it up and test it inside your CI pipeline. I find myself using it for more and more things. You mentioned that you hosted a DBT meetup? (1:18:03)

Why every data professional needs a local community

Juan: Yes. I have been hosting meetups since 2018. Currently I am the organizer of the Dutch DBT meetup. In the past I founded the analytics engineering meetup and the Dutch data visualization meetup. I love meetups. When I was younger I really used to like organizing parties at my house and getting people together. (1:18:28)

Juan: This is the corporate equivalent. For the company hosting, it is amazing marketing to get the undivided attention of knowledgeable data people for two hours. For the people presenting, it is an amazing career opportunity. For some people it is therapy to complain about things. Sometimes you are the only person in a company that understands your job. (1:19:05)

Juan: Being in a room with people that have similar issues is relieving. What I really get out of it is meeting interesting people. A meetup is a really good excuse to contact people and ask them to give a talk. Organizing a meetup costs me five to ten hours but it feels like it does good stuff for the local data community. (1:19:52)

Alexey: Where do you do them? (1:20:44)

Juan: Amsterdam. I try my best when I am abroad to visit them. I have a goal of talking in ten different DBT meetups throughout my career and I am currently at eight. It is super fun to see how the communities are slightly different. (1:20:44)

Alexey: Is it meetup.com? (1:21:13)

Juan: Yes. Please check it out. It was a big pleasure for me to talk to you. Juan, thanks a lot. (1:21:20)

Alexey: We didn't touch on many things. I went into the technical side and we missed the nontechnical side even though you said it is fifty fifty. But we all understand it is important. Many people stayed until now which is cool. Thanks everyone for joining us and asking questions. (1:21:45)

Juan: Feel free to connect with me on LinkedIn. I hope to see many of you in the course. I am currently recording a lot of the videos you are going to see. Thank you Alexey. I am a huge fan of all the cool stuff you do for the community. I enjoy how you are bringing accessibility to this for anyone who wants to follow for free. (1:22:30)

Alexey: Please come next week on Monday for the course launch. We will also have a workshop with Juan about DBT which will become a part of the course. It will be pre-recorded so you can see it earlier. It is going to be a lot of fun. Looking forward to seeing more content from you. (1:23:11)

Juan: Perfect. Look forward to working with you in the future. Bye everyone. (1:23:45)


DataTalks.Club. Hosted on GitHub Pages. We use cookies.