Season 3, episode 11 of the DataTalks.Club podcast with Victoria Perez Mola
Links:
The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.
Alexey: This week, we'll talk about a new role in the data team. This role is the analytics engineer. We have a special guest today, Victoria. Victoria works as an analytics engineer. She has a background in system engineering and she works as an analytics engineer at Tier in Berlin. Overall, she has over five years of experience working with ERP systems – reporting and databases. She's very passionate about using technology and she wants to help people to make their daily tasks easier. But in her free time, she likes to encourage people to enter the IT world by volunteering, teaching and mentoring. Also, Victoria is one of the first people who joined DataTalks.club – one of the first 10 or 20. Welcome, Victoria! (1:48)
Victoria: Thank you for having me. (2:42)
Alexey: Before we go into our main topic of what analytics engineering is, let’s start with your background. Can you tell us about your career journey so far? (2:45)
Victoria: As you said, I started as a systems engineer back in Argentina. That’s computer science. I knew a lot about accounting because my parents are accountants and I worked with them for several years. I like accounting as well, it felt super natural to me to start working with ERP systems. I was helping accountants use ERP – to use technology to calculate payroll and taxes. Then from there, I came to Berlin in 2018. I also continued working in a similar role – helping in the finance team, defining and making reports, connecting the European systems with other systems and automating processes. Eventually, I wanted to move from the European world and that's how I ended up in the analytics engineer role. I've been here now for seven months as an analytics engineer and I’m enjoying it a lot. (2:55)
Alexey: What do you do as an analytics engineer? What kind of responsibilities do you have? What does your typical day look like? (4:05)
Victoria: I do a lot of data modeling. I also work a lot around the data quality and data availability. Things that I could be doing on a day-to-day basis are building new pipelines, making that data available, building models. Also I clean that data so it's available for data analysts and the data scientists by exposing that data to Looker. I also work on it if something goes wrong. If something fails, then I have to chime in and check why the data is not available or why the data is not clean. (4:14)
Alexey: You said modeling – you probably mean data modeling, right? (5:00)
Victoria: Yeah, data modeling. (5:03)
Alexey: That would mean creating diagrams, how the data looks like, what are the entities in the data, right? (5:05)
Victoria: Yeah, but it’s more about writing that. More writing the SQL I mean. Building the models around the data, creating the tables, or views, and writing the queries behind it to model the data so that the data can be used for analysis. (5:15)
Alexey: You mentioned a tool called Looker, which is a tool for building dashboards, right? Or what is it? (5:40)
Victoria: Yeah. Looker is a BI tool, similar to Tableau. There, you can write queries as well. Then you can create dashboards and build reports as well. That would be the tool that gets exposed to the end user. The business users are going to be consuming the data from Looker. Then the data team uses DBT for doing the whole modeling. DBT is a transformation tool. We take the data that comes from the data pipelines, from either our backend events, or maybe our external data. Then with DBT, we transform that data and we make the models. We do everything we need to either clean the data, change it, or maybe calculate something – things like that. We try to do everything in DBT. (5:47)
Alexey: This “DBT” is a transformation tool you said. What does it do exactly? You just write a bunch of SQL queries and put them somewhere? (6:49)
Victoria: Yeah, you write a lot of SQL queries with it. But it has very good things. If you think about the software development process that you normally study in university and you think about data – they normally don't get along very well. DBT brings that to the data team and the data work. You have all these SQL files, but you also have YML files, where you would have documentation about the models that you're writing about. Everything is in a GitHub repository, so you have version control as well. It is sometimes very hard, at least in my experience, to introduce something like version control in data. But it's also something that is extremely useful to have. You can also write tests in DBT about your data. You can write your own or you can also use normal, unique, non-null kinds of tests. The nice thing about it is that you don't really have to write the DDL. So you wouldn't have to write CREATE TABLE, CREATE VIEW, DROP TABLE – you just write a SELECT and it does the whole thing for you. It compiles the code afterwards. (6:59)
Alexey: DBT is a tool, as you said, for transforming data – you write a bunch of SQL queries and then it takes care of actually creating this table. Is that right? You said you don't need to worry about the DDL. Then, you can also do tests with this tool, right? Check that the data quality is good. And I guess you can also schedule the queries? To run them on a particular day, for example? (8:25)
Victoria: Yes. It manages the dependencies. It even builds the DAG, and you can see how the models connect to each other. Let's say we have a source module and then on top of that, you build a dim table. Then you could run from the dim table everything that comes before, from the source, and everything that comes afterwards. This is very nice. You can easily check dependencies downstream and upstream as well. DBT is open source, but they also have a paid part – enterprise I think it’s called – which is DBT cloud. There, you can schedule everything. You can set the schedule and run your models depending on tags that you could use. For example, if there are tables or views or things like that. We have some of them. For example, some of the data is refreshing every hour. Some of the data only refreshes during the night. (8:59)
Alexey: You mentioned Looker, which is a tool for end users. You mentioned DBT. Are there other tools that you also typically use in your work? (10:04)
Victoria: Yes. I also use a tool a lot called Adlib – for doing the ETL. We mostly do ELE because the transformation is done in DBT. We also use Adlib. It's an ETL tool to load the data from the S3. We have the data coming into S3 buckets and then we use this tool to load this data into Snowflake. So – also Snowflake. We used to have Redshift before. Normally, you'd have to use one of these cloud computing warehouses. I've also seen other companies that use Python, or at least the folks in the analytics engineer role. At least they request for that. (10:15)
Alexey: Do you also use them? Or is it just something that others require? (11:05)
Victoria: I saw that others do it. I don't. I use a lot of SQL. That's my main language, and then Adlib, Snowflake and Looker. (11:09)
Alexey: Okay, so Snowflake is there. You have this DBT tool, where you write SQL queries. But these queries need to be executed somewhere. And this somewhere is Snowflake, right? This is where the queries are running. (11:23)
Victoria: Yeah. The queries are running in Snowflake. (11:37)
Alexey: How did you become an analytics engineer? I think you mentioned that you were interested in accounting and you were doing all this ERP analysis and working with financial data. But then at some point, you decided to become an analytics engineer. How did this happen? And what did you need to actually do to transition into this role? (11:48)
Victoria: Yeah, it was by chance. I wasn’t actively looking for an analytics engineer role, and I didn't really know what it was before. But I applied for a normal BI position. I don't even remember which position I applied for. Then I did all the interviews and I think the last interview was on a Friday. Then on the coming Monday, I was told that they were going through a hiring freeze because of the Coronavirus. After a few months, the hiring manager reached out to me again and said, “Hey, we're opening. We're hiring again. Are you willing to have a chat?” Then he told me about this position and it sounded really cool. It sounded like something that I wanted to be doing. Some parts are very similar to what I was doing before. Before I was also working with data warehousing and all of that. I was working more with the ERP data. That's how I ended up here. I even remember that at the end of that call, he says, “Okay, so then I'll send you the email with the details of the offer. And by the way, the role is called ‘analytics engineer’. (12:14)
Alexey: Okay. So you had no idea that this is the role you interviewed for? (13:44)
Victoria: Yeah, basically. (11:46)
Alexey: Okay. That's interesting. Do you know how or why the company actually decided to have this kind of role? (11:47)
Victoria: How did they come up with the idea of hiring analytics engineers? I know that during Corona, they were thinking about how to reshape the team. How were they going to grow, make the team grow, and all of that. They were using DBT before and the ones that are pushing for the analytics engineers, I would say. So that would also make sense that it came from there, but I owe you the answer to that one. (14:01)
Alexey: When I read the job descriptions, the position really looks like a data engineer. But on the other hand, there’s this analytics component. So what is the main difference between an analytics engineer, a data analyst, and data engineer? (14:34)
Victoria: The analytics engineer is supposed to be in the middle between the data engineer and the data analyst. The lines are a little bit blurry even in my team and not just in different companies. Some of us are regarded to be more in the data engineer side and others more in the data analyst side. I think that will vary. But overall, the data analysts sometimes have to do a lot of data cleaning/data availability. In reality, they have to have a lot of business knowledge as well. They should take care of analyzing the data and not cleaning the data, right? Maybe their SQL can work and it has a lot of business logic, but it's not the most efficient queries. Because they normally come from another type of background, not a computer science background. They know good software deployment practices. (14:51)
Victoria: In the case of the data engineers, they are way more technical, and maybe they lacki more of the business vision to the thing. They do have the software engineering practices, but they maybe don't have the domain knowledge. The analytics engineer is in between that. They are supposed to help the analyst apply their business models, but also work with the data engineers to bring in this business knowledge as well.
Alexey: They know both, right? The analytics engineer knows how to be a good analyst, and he or she also knows how to be a good data engineer. Right? (16:24)
Victoria: They’re in between, yeah. I mean, I wouldn't say that I would be a great data analyst, because everyone's different here. I have to know that part, but I wouldn't replace any of my data analyst coworkers. (16:38)
Alexey: Six months ago, we had a chat and you told me that you found this new job called analytics engineer. I asked you “What is that?” Before that, I had no idea that this role existed. But after talking to you, all of a sudden, I started noticing it everywhere. I would go to LinkedIn and I would see open positions. I would go to some Slack communities and I would see job postings there. Also on the internet, I started to see this thing. It looked like it became popular recently – at least that’s my impression. Do you know why it happened? Why did it become popular? And what is the gap that this role is trying to close? (16:54)
Victoria: Yeah, there’s definitely an analytics engineer movement starting. I left a few links about when people started to talk about this role, what they need, and things like that. There's one about Spotify. They talk about this very clearly. They were having issues where the analysts were spending too much time cleaning the data. Whenever they needed new data, they needed to set up this stream of new data. They needed to check the quality and they needed to model the data in a way that they could use. Only after that were they able to sit and do the actual analysis. Then on the other side, data engineers didn't really want it to get more into that part – taking that work out of the data analyst. That’s when Spotify said that they will open this position. They said, “Okay, let's just hire a new person. We need someone else – someone in between.” (17:43)
Alexey: Are you saying that data analysts spent too much time cleaning data and solving quality issues and that data engineers didn't want to take care of it for some reason? (18:52)
Victoria: Yeah, they didn't want to get more involved with modeling this type of data, because then they have to understand what the data is for. They just wanted to build the infrastructure. Also what Spotify was seeing was that data analysts weren’t very good at writing and doing these kinds of things. They weren't writing the best code. They started to see that they were hiring more people to do this. And then they said, “We need a person to do this as a full time job.” Then they opened this position and they invented a title, which wasn’t analytics engineer originally. It was something else along the lines of ‘data specialist’. That's how they hired their first analytics engineer. The guy that was hired as the first Spotify analytics engineer said at the beginning, “Hey, this title that you gave me is crap.” Then they maybe found a blog or something else and changed it to ‘analytics engineer’. (19:05)
Alexey: Okay, so it started at Spotify, right? And then other companies noticed it and also decided to follow? (20:11)
Victoria: I don't know. They (Spotify) did it in early 2018, and every blog seems to say that the role started over there. I wouldn’t pinpoint that “Yes, it was Spotify.” But people started noticing this missing role ‘in between’. One that allows analysts to actually analyze data, and for data engineers to actually just care about the infrastructure and the pipelines. That’s how someone came up with the analytics engineer role. (20:19)
Alexey: Yeah, I actually thought that data engineers should take care of data quality. But I never thought that they wouldn’t need to have this domain knowledge that maybe is difficult to acquire, while analysts have it. Yeah. Before this interview, I wanted to check a couple of positions (job postings) for an analytics engineer and see what kind of requirements there are. I didn't check Spotify, but I found a job posting from Airbnb. (20:52)
Alexey: They have these requirements: the first is “understand data needs by interacting with data scientists and data engineers.” Then the second one “Architect and build data pipelines with data engineers.” The third one is “Be a data expert and tone data quality.” I think we talked already about all these things like data pipelines and taking care of data quality. Then the fourth one is “Build and improve data tools for auditing, error logging, and so on.” And the fifth one is “Design and build dashboards to enable self service.”
Alexey: Do you think this is an accurate description of the requirements that analytics engineers have, in general?
Victoria: Yeah, I would say so. It goes from the pipeline to the BI tool. But since it's such a nice position, it's going to change from company to company. For example, Spotify also talks about something quite similar. They talk more about being the “data owner” and to check on data quality. But in other companies – I've seen in Trade Republic — they don't even mention data pipelines. The role seems to be more on the business side. So it’s going to lean a little bit more towards the data analysts. (22:25)
Alexey: So there is a wide spectrum. On one side you have the data analysts, on the other side you have the data engineer, and there is a whole spectrum of how you can mix the two to arrive at the analytics engineer. You can take 50/50 – I think in the case of Airbnb, my impression is that it leans more towards the data engineer, because there is a lot of work with data pipelines and data tools. But still they have this “design and build dashboards” part, which is more of what analysts would typically do. So they have maybe 70% data engineering and 30% data analysts. Then you said Trade Republic, is more maybe on the other side of the spectrum. So maybe 70% analysts and 30% engineers. (23:13)
Victoria: They do write a lot in the Netflix blog about this. I really like the Netflix blog in Medium. They actually have an article, which I also put in the links, where they talk about this. It looks like it also varies from team to team. Some teams are going to need people like analytics engineers, who are going to be more technical than others. There's also another link that I put about Nubank. They also use DBT. While, Airbnb doesn't seem like they use DBT. Even the tool changes from company to company. In Nubank, they even have a comparison of what they expect to have in the analytics engineer profile versus the data engineer and the data analyst. In the case of the analytics engineer, it goes more towards analytics and reporting, data pipelines, data modelling, and the whole data quality and data sharing part. (24:10)
Alexey: The last comment in the live chat that I just noticed is the comment from Lufassa is “Data engineers are mainly focused on infrastructure. They don't leverage insight and curate data. Analytics engineers would take the data and carefully curate it, so analysts can streamline and use the data easier.” Do you agree with that? (25:18)
Victoria: Yeah, I would agree. I didn't hear the name properly. I was thinking it's my coworker. (25:41)
Alexey: Lufassa. But maybe I’m mispronouncing it. What's the right way of pronouncing? (25:50)
Victoria: No, it's not. It looked a little bit like it. (25:57)
Alexey: Yeah, but there’s Alan who says, “Victoria rocks,” maybe he’s your colleague? (26:00)
Victoria: No, I don't know any Alan. Maybe he just realized I rock. (26:04)
Alexey: Yeah, I do agree with him. I'm continuing with the same position. We talked about requirements. After requirements, we usually have the skill section. (26:10)
Alexey: In that position, the skills that Airbnb requires from analytics engineers is SQL, not surprisingly. The second category of skills is “distributed systems for data processing” which is Spark, Presto, Hadoop and Hive. Then they have programming – Python, R, Schema design, dimensional data modeling. The last thing, which I liked – “an eye for design”.
Alexey: In my opinion, most of these things are technical, maybe apart from the last one. I think most of these skills, apart from “an eye for design,” is what I would see typically in data engineering positions. I think dimensional data modeling, schema design, is also something that data engineers tend to do when designing data warehouses maybe, or data lakes, or whatever. To me, they look quite typical for data engineers. Is it a typical skill set for an analytics engineer, or?
Victoria: If you want to apply for an analytics engineer role on Tier, and we do have a lot of openings, by the way. I wouldn't say you need all of this toi apply. You definitely need to know SQL. You need to know data modeling. So things like what a dim table is, and what a fact table is. Basically, you need to have read Kimball’s data warehouse book. Then I would also expect something for Snowflake, so definitely not Spark, Presto, Hadoop, Hive. That looks way more data engineer-focused to me. As for programming, Python or R – analytics engineers don't use that at the moment. It could be something that eventually we start using. We do have some Python scripts, so it's not like I've never seen Python code ever in the last seven months that I've been working at Tier. But I wouldn't say I needed it. To me, Python seems like something that’s easy to pick up if you know coding already. Not so data engineering-focused. But again, Airbnb the analytics engineer role definitely looks way more technical. (27:31)
Alexey: You also mentioned Spotify and Trade Republic? Do you remember what kind of skills they require for their positions? (29:05)
Victoria: I know Trade Republic has a very similar tech stack to the one we have. They also have Snowflake, or are about to have it. They use DBT as well. And some ETL tool as well – something around that would probably be necessary. I don't remember what it was in the case of Spotify. For example, Nubank is very similar to what we have as well. They use DBT a lot. They've been featured in the DBT blogs and everything. You would need something like SQL for sure. That is going to be what you need to be familiar with, at least with everything like data modeling. (29:15)
Alexey: When I see DBT mentioned, I often see the analytics engineer role mentioned as well. I think this is a pretty typical tool that analytics engineers use, right? (30:06)
Victoria: Yeah. (30:18)
Alexey: I think even DBT has an article about the analytics engineer role. The company itself wrote an article about the role of the analytics engineer, right? (30:19)
Victoria: Yeah. It's also in the links. DBT also has useful resources for learning and more. They're one of the main companies that started the analytics engineer movement, for sure. They did this by participating in blogs or talking about the analytics engineer role, or by being in conferences talking about analytics engineers. They've definitely started everything – it's heavily related to DBT. When you think about an analytics engineer, or at least I do, I immediately think of DBT. I think it goes the other way around – DBT and, obviously, the analytics engineer role. (30:30)
Alexey: I work as a data scientist. When it comes to data science, the description of roles is a little bit different for every company. We have some data scientists at our company, OLX, who are more actually data analysts – they do the kind of work that analysts do. But in some cases, data scientists are doing the engineering work. The other end of the spectrum is ML engineers, who are sometimes also called ‘data scientists’. From what I see, when it comes to the analytics engineer role, it's very similar. There is an analyst, there is a data engineer, and then you have the whole spectrum of things in between, while every company has its own interpretation of the role, right? (31:09)
Victoria: Yeah. But I also think new data teams that include analytics engineers have a more defined data analyst or a data scientist role. As a data analyst, you usually only take care of analyzing the data, right? Because then, analytics engineers are going to take care of the rest. Same for data scientists. Data scientists complain all the time about cleaning the data. Well, they don't need to, because we are there for them. (32:08)
Alexey: Okay, so I should ask for analytics engineers at work to help me clean the data? (32:42)
Victoria: You should ask for an analytics engineer, definitely. You need one. (32:54)
Alexey: Okay, so this is who will help us clean the data? (32:58)
Victoria: Yeah. (33:01)
Alexey: You mentioned that analytics engineers help analysts and data scientists with cleaning data. They help data engineers to maybe understand the business domain better. In general, how do they work with others in the team? With product managers, for example? With backend engineers? With other people in the team? (33:02)
Victoria: In my case, my stakeholders are my coworkers most of the time — the data analysts and the data scientists. So it varies a lot. I don't have much exposure to product managers, to the business’ stakeholders. Some of my coworkers have this exposure. The analytics engineers have more, but the idea is also that the analyst is going to still be doing their work. They're talking to these business stakeholders to understand what they need, and then also talk to us. But this doesn't mean that the analytics engineers are not going to talk with the business stakeholders, it could be that you also need to go directly to the business stakeholder as well. (33:31)
Victoria: In the case of the backend engineers, say there's a new event that has to come in because of new data coming in. You're probably going to have to talk with the backend engineers as well to see how they set up that event, because at the end, you're going to have to consume that. Of course you have to get involved in this case.
Victoria: Also, one thing about the analytics engineers in a team and how it works. In my company, we have the analytics platform, it has the data platform. Then we have more – operations, analytics, commercial analytics. And they have a panel for the data scientists, and we’re in the platform, and then the data engineers. Now we’re on the platform, we are old analytics engineers. But then it went off these teams, they are also getting an analytics engineer. So it's getting to be decentralized. These analytics engineers are going to be more exposed to these business stakeholders, because they're going to be inside the operations analytics team. This is also in the links. I left a link on how Nubank is talking about how they're doing that. How they're scaling and making it so that these teams don't depend on the analytics engineers for the platform. They have their own analytics engineers in their teams. This is also a way to collaborate.
Alexey: To summarize and to make sure I understood you correctly. There is a platform team, where there are a lot of analytics engineers, and then you have separate teams, and each team would have one analytics engineer who would work with the rest of the people in the team, right? (36:00)
Victoria: Yeah, exactly. The commercial analytics engineer will only take care of the commercial topics, for example. Whereas in the platform, I'm going to be working more on the base data in general – more source client data or things like that. (36:21)
Alexey: Actually, speaking of sources – we have a question about that. “In your role, how do you deal with bad data coming from different sources, or from changing schemas?” (36:44)
Victoria: In the case of changing schemas, the tool that we have at the moment adapts to that. We don't have much of an issue there. Of course, we have models that rely a lot on that. But in that sense, I would say it's not such a big issue. In terms of data quality, that's something that we're still working on a lot. I don't think that's something that you ever finish. You never get to a point “My data is 100% the best data.” But there's also a lot going on in the company note, and a little bit outside of us as well. A lot of this data is coming from the backend engineers – the data has to come clean from there as well, right? Or it has to be good quality when it comes from there. (36:57)
Victoria: What we are doing right now is cleaning a lot of that in DBT and that gives us a lot of control. We can see that code, what it’s doing, what the records are. Are they duplicated, or excluding, or transforming? We do a lot of things like that. DBT also lets you create what they call macros, which is a UDF similar to what I would create in any SQL, Servo, Snowflake or whatever. We can also use that sometimes to try and form the name of the cities, for example. We have the same names for all the cities in all of our day towers, things like that. And there are also tests. We check that the cities are defined. That if a number has to be between one to five – it doesn't go beyond that.
Alexey: These tests apply to each incoming row? So for each row that DBT sees, there is a test. And if some record doesn't pass this test, you get an alert. (38:53)
Victoria: So-so. The test is basically a query. Let's say you define your model. You create your model and then you define the columns. Then you define that for this column, the test that I'm going to apply is “not null”, for example. Then, at the end, when you run the test, it's going to select from this table where this field is now. Then if it returns nothing, the test is passed. If it returns something, then it's going to give you either a warning or an error. The nice thing about DBT is that we can do that before building the models. What we do is check that the sources don't have errors. If they do have errors, then the modules that use the sources are not going to be built. We do this so that we don't build things on wrong data. (39:04)
Alexey: That's how you control quality. Then I guess at some point, somebody comes to you – probably an analyst – and says, “Hey, something's wrong with this data.” And you start looking and maybe realize that you missed something in your test. Then you would add an extra test. Right? (40:01)
Victoria: Yeah. It’s unfortunate that we never get to the point that someone says, “Hey, these numbers don't match.” I would like it if we had a test for everything, so that we are always ahead. But I don't know if you can ever get to that point. (40:19)
Alexey: Thanks. We have another question. “How is this position related to BI analyst and BI developer? And how is it different from these two?” (40:42)
Victoria: I don't know exactly what a BI Analyst would be. I would imagine something more like a data analyst. Then it would be the same as before, right? That means you're a little bit before. You don't really do the analysis, per se, you just make the data available so that the data analysts can do the analysis. And with the BI… (40:53)
Alexey: BI developer. I suspect that maybe in these BI tools, instead of having a data engineer and data analysts, you would have a BI analyst who could do the analysis. Then you would have a data warehouse developer, who would actually build the data warehouse. Which is probably synonymous to the role of a data engineer these days. (41:17)
Victoria: Yes. I would say maybe the BI developer is similar. I think it would also depend on what the company calls a BI developer. It's quite close to maybe the analytics engineer. Because I would imagine that they write SQL as well and things like that. (41:43)
Alexey: Let's say I am an analyst and I want to become an analytics engineer. How can I do this? How can I make this transition? What kind of things do I need to do to become an analytics engineer? (42:05)
Victoria: First, learn about good software development practices. What does good code look like? What are good practices to implement? Definitely learn about data molding, or read books like Kimball and maybe Edmon. There's also two links that I left. One is DBT learning. This is free. Of course, it's around DBT, but they do talk about data tables, fact tables, what those are, and things like that. There's also another repository that someone very generously built and it has a lot of links to readings about pipelines, readings about good practices, readings about SQL. Definitely write good SQL. And learn other kinds of tools like DBT. (42:27)
Alexey: You mentioned this repository with information. Is it something you'll also put in the links? (43:30)
Victoria: Yeah. It's in the learning resources. This one is very good. It’s called Analytics readings. (43:39)
Alexey: Analytics readings. The links, there is a question – “I'm not seeing the links.” They are not in the chat, they are in the description. If you go to the description, the first link there, you click on this, and there's a Notion document with all the links that Victoria prepared for today. (43:48)
Alexey: These are the good resources. You said an analyst would need to pick up some good software engineering practices, learn data modeling, and then learn DBT skills – the most important tool. But I think for analysts, this is not a problem, they already know SQL, right? More or less. I think for analysts, this is the tool that they use 80% of the time, perhaps. They should be pretty good at this already. (44:10)
Victoria: So probably not good enough to be the data analyst of Spotify. Apparently. (44:43)
Alexey: Let’s say I really like what I hear. Suppose I'm an analyst or data engineer, or somebody who works with ERP systems, and I think “This is really cool. I want to try it.” How do I make sure that this is something for me? How do I make sure that I love this kind of work? (44:52)
Victoria: If you like data modeling, figuring out how to read models, tables, how to model the data, and make it available, then the role is likely for you. Let's say you're a data analyst, and you show more of that part – I think you're getting close to the analytics engineer role. If you also care a lot about the data quality, I would say, and about good practices, as I said before. For example, with my team, we build a lot of guidelines and things like that for everyone. For the data analysts and data scientists as well. We write about how to write the code, what the good practices around this are, how to take care of the tests, and things like Adlib tests, or how to do a proper peer review and all these kinds of things. That’s something that we also have in common with the other analytics engineers – we care a lot about that. More than maybe the data analyst. They care a little bit more about answering those questions. Which is okay, because that's what they're meant to be doing. (45:16)
Alexey: I'm also wondering if there are some annoying parts in your work. For me, as a data scientist, I have to clean a lot of data. For many people, this is “Sigh. Again, I have to clean the data.” Now I know the solution – we just need to hire an analytics engineer. Are there such things for analytics engineers? The things are annoying, but you have to do them anyway. For example, if I don't clean the data, then my models will be bad, so I have to do it. It's annoying, but it's inevitable. Are there things like that in the job of an analytics engineer? (46:28)
Victoria: Yeah. I would say that the most – I wouldn't say it's extreme – but the most annoying part is: we do a lot of defining guidelines and that kind of thing. We can’t really make people follow them. That would be the most annoying part. Also, another not annoying, but not-as-rewarding thing, if I compare it with other jobs that I had in the past. Before, I would automate something like a process and reports when I was working. When my stakeholders were the accountants, and then the month would close, suddenly it was being done in one day thanks to my work instead of one week. It was like a party, right? I was a superhero. Now, since my stakeholders are technical people, they know a bit more, so they don't get surprised. It’s like a boost to your ego maybe. But I also asked this question to my coworkers. One of them said that the annoying thing is “working with another analytics engineer,” but he was joking. (47:11)
Alexey: How many analytics engineers do you have in your company? More than two? (48:29)
Victoria: Nine. (48:36)
Alexey: Nine. Okay. (48:38)
Victoria: Well, I can be very pushy, so I don't know. But my coworkers also talk a lot about their stakeholders, management, the QI and QC. Everything around that kind of thing. Sometimes you have to deal with things that you don't expect from backend – backend events. Even with the stakeholders – suddenly you have to jump in to stop something that could have been planned or something like that. Also, we don't have much control over the raw data. We are very limited to the tool that we have in that sense. It's not like we build custom pipelines because we use this tool. (48:42)
Alexey: You have less control over the raw data than data engineers, right? Maybe data engineers can do more, and you're limited to the tools. (49:30)
Victoria: Another thing that one coworker said – it was that data quality is not our fault in most cases. Unless we made a mistake, of course. But we are affected by that mistake the most. We're the ones that take care of that kind of thing most of the time. When the quality of the data isn’t good, we are the ones that immediately have to jump in and take all this ad hoc work. (49:42)
Alexey: I guess that happens for many people, not just analytics engineers. Analysts also need to do ad hoc dashboards for something important. Data engineers will also need to take care of some data quality issue that comes up – they need to run and fix it. (50:11)
Alexey: Do you know what a data profile is? Because there is a question about data profiles. I don't know, maybe I can just read the question and you tell me if it makes sense for you. (50:30)
Victoria: I don’t know what a data profile is. But yeah, you can read it and let's see if we can answer it. (50:43)
Alexey: This person is curious about data documentation. “I've been planning to use DBT, but I noticed that it doesn't have a great data profile. How do you show the data profile?” (50:46)
Victoria: Oh yeah, I know what a data profile is. You see how the values are – whether it’s average, normal and things like that. Yeah, it's true, DBT doesn't deal with that very well. There are other tools that are working now until the upgrade of DBT that take care of that. For example, Datafold has something like that about data profiling. There's another one, Monte Carlo – I think they also have data profiling. There are other tools as well. Unfortunately DBT doesn't have that. At least, I think it doesn't have that yet. But it's also because it's not the role of DBT. DBT is going to support you in the workflow. But on documentation, which is not the same as data profiling, which helps you to understand the documentation, so it could be considered part of it, but I wouldn't say it's the same thing. DBT actually has a lot regarding documentation. It could have more, but it has a lot already. It has a schema YML associated with every module and every source. There, you can write descriptions for your model and you can also write descriptions for every field if you wanted to. Then you can have tags on your models. You can even add your custom metadata fields. (50:57)
Victoria: Let's say I want to have a metadata field for the area that the model is allowed to be in or something like that. Then all of this goes to these DBT docs, that you can host yourself, or it's part of the DBT cloud. There you can see everything and you can filter by the model names, the field names, etc. So it has everything like that. It has a very small profiling part, which is the amount of rows that the table has, table size, and maybe one more little thing, but it's not super detailed. It has the code and it has dependencies. It's very easy to go from there and see what else you are going to affect if you touch something.
Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.