Machine Learning Zoomcamp: Free ML Engineering course. Register here!

DataTalks.Club

Big Data Engineer vs Data Scientist

Season 4, episode 3 of the DataTalks.Club podcast with Roksolana Diachuk

Did you like this episode? Check other episodes of the podcast, and register for new events.

Transcript

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Alexey: Today we will talk about the difference between big data engineers and data scientists. We have a special guest today, Roksolana. Roksolana works as a big data engineer at Captify. Today, she will talk about the roles of data engineer and data scientist. Welcome. (1:52)

Roksolana: Thank you. (2:16)

Roksolana’s background

Alexey: Before we start with our main topic, let's talk a bit about your background. Can you tell us a bit about your career journey so far? (2:19)

Roksolana: Yeah, sure. I have a software engineering degree from one of Kiev’s universities. I have both Bachelor’s and Master's degrees. After that, I worked for some time as a backend engineer using mostly Java. At some point, I learned about data science and big data engineering. I had some time to make a decision about what exactly I wanted to do. Backend engineering just got a bit boring for me at some point, so I switched into big data engineering and learned the programming language Scala, which is currently my main programming language. For a few years, I worked at a company called Ciklum. It's quite famous in Ukraine, because it's quite a big company – one of the top five. It's an outsourcing company, which works with clients and does some research. I was working at the R&D department there. We had some internal research stuff and we also worked on some client projects. (2:28)

Roksolana: About two years ago, I joined Captify, which is a product company. They build products for the advertising industry. Actually the company is British – it's based in the UK. But a part of the engineering team is located in Kiev, where I'm based. I mostly work on the product parts, specifically in the big data engineering team.

Alexey: I’ve heard that in advertising there is so much data that big data engineering becomes really important. These companies get terabytes of data every day and they need to effectively process this data. Is that an accurate assessment? (3:54)

Roksolana: Yes. What makes Captify’s solutions unique is the data insights, which are obviously delivered through the help of the big data engineering team, which has different sources of data and different ways to transform this data and deliver it to the client. (4:10)

A Big Data Engineer’s typical day at work

Alexey: What do you do at work? What do you usually do as a big data engineer? (4:26)

Roksolana: My main responsibility is building data pipelines. Usually, it's in ETL format – extract, transform, load – this entails reading the data from some source, building transformations, doing some aggregations, and uploading the result somewhere so that it will be available for other users. This can be a relational database, or it's often just data storage, like HDFS or S3. Users can query the data using query engines like Impala. Usually we service the data in such a way that the analyst team could run queries and build reports with it. That's my main responsibility. (4:32)

Roksolana: Aside from that, we have some internal libraries development and research. For example, right now I'm working a bit on the Delta Lake introduction from Databricks. We also work on some features or do some small fixes because there’s quite a bit of legacy code already. However, the product is quite stable.

Roksolana: Our department also does some new pipeline creation or just upgrades the existing infrastructure. Sometimes we just rewrite some parts to increase their performance. Optimizing performance of existing pipelines is also a big part of the job, especially during some production incidents, or in cases where something is not performing as good as it's supposed to.

Alexey: So basically, your main job is to take some raw data that is coming from your product – data that users of your product generate. Your job is to take this data, convert it, build data pipelines using ETLs so that analysts or other users who need to analyze the data, can use tools like Impala to access this data on SQL queries and get some insights. Is that right? (6:03)

Roksolana: Yeah. (6:31)

Alexey: You also maintain this Impala to make it possible for them to do queries? (6:33)

Roksolana: That’s partially the work of the infrastructure team. We mostly work on optimizing queries and jobs in such a way that other users will be able to run their queries. Otherwise our jobs will take up way too much resources – it will be impossible to work with them. But we set up our own Spark jobs and optimize them in terms of resources, in order to answer questions like “How many nodes in the cluster do we need? How do we want to scale that?” We decide whether we want to receive less or more resources for our own jobs so that their performance is optimal, or whether we want to optimize it using Spark capabilities, which would be just improving the code base. (6:38)

Big Data Engineer’s tools

Alexey: So what kind of tools do you use? You mentioned Spark, Impala, and you also mentioned HDFS and S3. What else do you use for creating these pipelines? (7:18)

Roksolana: Some other main ones, aside from some AWS services like S3, some services for Spark setup. Spark is built using Lazarus for now. We have a bit of Kubernetes, which we mostly use for metrics. We also use alerting systems like Prometheus and Grafana as well. We have some backend libraries for the data processing part, like Play in Scala or library split tests, which also are Scala libraries like Scala Test. (7:30)

Alice discovers Kubernetes

Alexey: I remember you had an interesting talk with a funny name. Something about Alice discovering Kubernetes. What's the name of this talk? (8:04)

Roksolana: I have a series of those. The first one was when Alice learns the difference between functional programming and Kubernetes. (I'm getting red. [laughs]) (8:16)

Alexey: How did you come up with this idea about Alice? (8:28)

Roksolana: It’s hard for me to say. It's kind of a cumulative experience because before I started to speak at conferences, I visited tons of local and European conferences. I noticed that it's much more interesting to get invested into the story aside from the technical side. That's how I tried this idea with the first talk. After some time, I decided that I probably need to build more of those. That sparked popularity because people enjoyed the story side quite a lot as well. (8:31)

Alexey: It became like a brand. (9:03)

Roksolana: Sort of. (9:05)

The difference between big data engineers and usual data engineers

Alexey: Sometimes I get a bit confused. We have data engineers, we have big data engineers – is there any difference between these two? Or are they mostly synonyms – big data engineers and usual data engineers? (9:12)

Roksolana: If you’re going ‘by the book’ or the way it's supposed to be – there’s a difference in terms of how you process the data. Big data engineering requires tools that are a bit different, like heavy load optimizations, data engineering, and a bit of software engineering on the backend. But in reality, I would say that a lot of companies conflate big data engineers with data engineers. It's possibly a bit confusing because of that. (9:27)

Alexey: So many companies just drop the ‘big’ part and just go with ‘data engineer’. Is there any difference in the tools they use? For example, you mentioned Spark, Impala, and things like that. From what I see, data engineers use tools that are maybe a bit different. Most of them use Spark as well, but I see more cloud-based things like Streams, Lambda functions and things like that. Or does it just depend on the company and there actually isn’t a big difference? (9:52)

Roksolana: I would say that it depends on the company. Some companies choose multiple cloud services and some companies build custom solutions. Because of that, data engineers might be more on the side of parsing data, which is considered to be more of a backend thing, or maybe they write/read from the database. While a more big data thing would be working with so called Big Data-specific data formats like Avro, Parquet and ProtoBuf. Whereas backend engineers usually work with JSON or maybe a little bit with CSVs. (10:29)

Alexey: Okay, so in the end, there usually isn’t much of a difference. Would you agree? (11:02)

Roksolana: Yeah. (11:05)

Data Engineers’ skills

Alexey: OK. We talked a bit about tools – we talked about Spark, Cloud, Scala. I think Python is quite popular in data engineering as well. If we're not talking about specific tools, but more fundamental skills, what kind of skills do people need to be able to do their job? (11:07)

Roksolana: I would say the most important one is coding skills. Often senior level engineers get into data engineering, which is quite logical, because they already have some experience on the backend side. Then they learn the Big Data stack and just understand how it works behind the scenes. So I would say it’s definitely important to have a great level of coding skills. (11:32)

Roksolana: Another one is working with databases, like writing queries and being able to optimize them. Usually, it's SQL databases, but sometimes No-SQL as well. So they need to be able to switch from one to the other. In my experience, when you join different companies, you sometimes have a totally different stack, or even different projects, so you just have to switch really quickly. But the skills I mentioned would be the main ones.

Roksolana: Something I also consider important, but is sometimes missing, is the infrastructure-side skills. This would be understanding the networking side. For example, being able to know what the racks are supposed to look like, how it works on the hardware side, why it's important to optimize some of our applications in certain ways. So they have to not be afraid of getting into the infrastructure side and setting something up because we do need to do that sometimes.

Alexey: What about distributed computing and things like this? Or is this included in ‘databases’? (12:51)

Roksolana: I would say that currently, frameworks are on such a high level that sometimes people don't need to understand that. But I would say that it’s quite a basic thing – university stuff. (13:01)

Alexey: I remember at university, we studied MapReduce. I don’t think I really used Hadoop outside of university. But it's still important to know these concepts, right? (13:13)

Roksolana: In the current environment, I would say it's important to understand them. But in terms of Hadoop, it's getting really outdated lately. When I was starting out, it was important to understand how it works – HDFS specifically. But right now a lot of people and a lot of companies just switch to something else instead of that. It's easier to maintain some cloud services, for example. (13:31)

Data scientist role from the perspective of a data engineer

Alexey: We talked about big data engineers – they take care of data preparation. Some data is generated by the users of our product and the data engineers take care of processing it and making it possible for analysts and other people to run their queries on it. What do data scientists do? (13:56)

Roksolana: The main part would be actually building machine learning models, but it's only one part of the machine learning model cycle, as it's called. They do need to clean the data, prepare it to build features, create the models, and deploy them. Sometimes it's in their role, sometimes it’s in someone else's, but they still need to have a good understanding of how it's supposed to be deployed so they are able to evaluate it further. It's called a cycle because it can repeat. If we need to fix some features, tune hyper-parameters, or just do something general like switch to another solution if a particular one doesn’t suit our needs. (14:20)

Alexey: So cleaning the data is something that data scientists do? You cannot expect big data engineers to clean data for data scientists, right? (15:01)

Roksolana: I would say that it's a bit controversial because sometimes data engineers do take care of that. Sometimes data scientists need to still take care of this pre-processing step. It depends on the company as well, or just the way the pipeline is built. (15:09)

Alexey: In the end, you could say they both do cleaning, just different kinds of cleaning. Is that an accurate assessment? (15:23)

Roksolana: Yeah. (15:30)

Big data engineers’ tools vs data scientists’ tools

Alexey: I think I have an understanding in terms of tools. I think data scientists use quite a different set of tools compared to big data engineers, right? (15:32)

Roksolana: Yeah, sometimes the tools they use can coincide in Spark, because Spark is getting more and more popular for machine learning as well. Python is also heavily used by some big data engineers, so they coincide here as well. Mostly machine learning engineers or data scientists would use multiple specific libraries for specific model cases, whether it's a recommendation system, deployment, or computer vision. They may not get involved that much with infrastructure, however. The databases would be practically the same, especially for data engineers, who just pull the data there and the data scientists just take it from there. (15:46)

Communication between big data engineers and data scientists

Alexey: What about in terms of the programming language? You use Scala. Do the data scientists you work with also use Scala, or do they use Python? (16:26)

Roksolana: They use Python. We communicate through files or data that we write to the database. Therefore, they don't need to go into our source code. But they also have a software engineer / machine learning engineers who work with both Scala and Python, depending on the task. (16:38)

Alexey: Okay, so you produce a file – a parquet file, for example – and then the data scientists know how to read this file using Python, for example, and this is how you collaborate. Correct? (16:55)

Roksolana: Yes. (17:04)

Alexey: Actually, my next question was, “How do you work together?” I think we've partly answered that. The interface for your communication is the files that you create, right? You create some files and the data scientists consume these files. But how do you work together in general? Do you work in the same team? Do you work in different teams? What does the process look like? (17:07)

Roksolana: In my company, we work in different teams. We don't even have that much of a connection, because we don't really know what data scientists do with the data later. We just deliver it the way that they need it. For example, they can just ask to add some field or to build some transformation, because it's easier to build it on our side, which would reduce the heaviness of the load. In other projects that I saw, or that my colleagues work on in different companies, sometimes they have a specific team where they have a dedicated data engineer, or multiple data engineers, and they work closely on each step of the pipeline. So it’s different from the way that my company works. Sometimes the workflow is the same as in my company. Throughout my career, I didn't really work closely with data scientists and I learned about data science outside of work. I was just interested in understanding how it works. While my colleagues from a previous workplace, for example, they changed their jobs and constantly and closely work with data scientists. So there can be different workflows in different companies. (17:28)

Alexey: So if a data scientist needs a new field, would they just go and create a JIRA ticket for you that says “Hey, I need this field.” (18:28)

Roksolana: Yeah. (18:36)

Alexey: Something like that, okay. So this is how you interact with data scientists. You sit maybe in different rooms, or, I mean… Now we don't have rooms – but you’re in different teams, right? And you communicate occasionally through JIRA, or maybe some common meetings that you have very frequently. (18:36)

Roksolana: Yeah. (18:53)

Example project walkthrough

Alexey: Okay. Maybe we can do a sort of walkthrough of a project. Let's say you want to start a new project and it involves machine learning. But you still need to process some data before the data scientists can work with it. Do you have some ideas for a project that we can discuss? (18:54)

Roksolana: We can take a recommendation system case. For example, if we have a Netflix-type website where there are different types of data – users’ information, history of their ratings of movies, or just their search history. We need to recommend some movies to these users. The part of data engineering would be to extract this data. They could probably build two pipelines – one streaming and one static/batch processing. With streaming, some new data is constantly coming in to update the model later. The batch pipeline would be used to store the history and it can be in different formats. Then we would build some kind of transformations to write it to either files like parquet or CSV for better processing for data scientists. Or we would write some part of the information. For example, we could store the data about movies or users in the database and have the streaming information of all their ratings. (19:18)

Roksolana: Then we could combine this together and the data scientists would be able to go through this data, which would be cleaned from duplicates or some zero values, which can get a bit confusing later. They can define which features will play the biggest role and build the machine learning model. The deployment part depends who is responsible for it. Sometimes, it’s machine learning engineers who do it, sometimes it’s data scientists themselves, or it can even be the data engineers. So someone would deploy the model to production and then the data scientists could evaluate whether the solution works or not and deliver some results. Actually, if it's deployed, there is a library that is able to be connected to Tableau or write the results to the database if it's recommendations. Then we can display it in Tableau in such a way that other users, like data analysts, or business users can just visualize the data based on graph charts. This allows the data scientists to present the solution with these results in a more graphic way.

Alexey: Okay, so we have a Netflix-like website where users can watch movies. Every time a user watches a movie, or leaves a rating about the movie, this data ends up in your data pipeline immediately? You said you have a stream of data – this would be the kind of data you mean – right? You also have some processing jobs that take this information, or this event, process this, and then put it into some sort of storage, like parquet or CSV. Then the data scientists take the data that you prepared from these events, and they can train a model with that. I assume if this thing runs for maybe a month, they can then take this one month of data and train their model. You said the data scientists also take care of deploying the model and there are some analysts who can look at the results. (21:33)

Alexey: So what kind of tools would you use here? For the steaming part, for example. Where do you use something like Kafka or Kinesis? Or what kind of tools would you use here?

Roksolana: I would probably use Flink. It's good for streaming. Spark is not that good, but it's easier to connect Spark and Flink. For example, you'd have a Spark pipeline for batch data processing if you already have some history about the users. Let’s say that the website was created some time ago so we already have some historical data and stream it using Flink. Then you can write the Flink files on some parquet to S3 storage and the Spark pipeline to be stored within a database, which would include the historical data about users and movies. Then we can combine the two in a library or the data scientists could read from both and rely on historical data to build better predictions. (22:51)

Deployment

Alexey: Then they use their data science tools. At the end, you have a model which they deploy themselves? Or you can also help them deploy it? I imagine that if you have this stream of data, this is probably where you can put this model. Or not really? (23:40)

Roksolana: Yeah, actually data engineers can help with deployment. For example, there are tools like ML flow, Kubeflow, which I know other teams in my company are using. Therefore, they have a dedicated person for that – a data engineer or machine learning engineer – it depends on what the company calls this person. But they work on this more infrastructural side of things, like where you build an API, or you have to work with Kubernetes or cloud services to build the provision for this model service – probably build some container for it. It's more backend work than data science work. (23:57)

Alexey: And this person is in the data science team? They help the data scientists with this engineering work, right? (24:38)

Roksolana: Yeah, that's kind of the way it works in my company. (24:46)

How much should data scientists know about data engineering?

Alexey: Okay. We have a question and it's quite interesting. It’s related to what we're talking about now. “How much should data scientists know about data engineering and what kind of skills do they need to have?” You mentioned that you have a dedicated engineer in the team. Does this engineer take care of all the engineering stuff or do data scientists still need to have basic knowledge of data engineering? (24:49)

Roksolana: In my company, they’re called machine learning engineers because they deploy their own models sometimes, and they operate with data. Sometimes in other teams, or in other companies, the data scientists only work with machine learning models. But I would say that that would be an ideal case because startups or new projects usually have this so-called ‘full stack data scientist’ – where they make one person do everything, or at least both the model and deployment parts. I would say that it's important for data scientists just to have an understanding of why some pipelines are built a particular way. It would be helpful with assessment of time, or having an understanding how long it might take for them to get the data or how some issues on the data engineering side can influence them. But in general, I would say that it's only necessary to have these skills if the person has to do the whole pipeline, which also happens sometimes. (25:20)

Alexey: Okay, so your data scientists are engineers, basically. If they need to, they can go and figure out how your pipelines work. Is that right? (26:21)

Roksolana: Yeah. (26:31)

How can data scientists acquire knowledge about data pipelines?

Alexey: But, in general, you said that it's a good idea for data scientists to know how the pipelines are built. Do you know how they can acquire this knowledge? How can they learn about how data pipelines are built if you're working in different teams? The data scientists work in one team and you work in another team – how can data scientists gain this knowledge? (26:32)

Roksolana: I would say that there could be some knowledge-sharing sessions inside the company. We have, for example, internal engineering meetups, where each team can just talk about the technology that we use and how we build some solutions. Aside from that, if they are interested in the topic, they can also learn through resources like books, courses, or lectures. I think that there are quite a lot of resources lately on big data engineering. It's becoming more popular. (26:57)

Should data engineers become more like data scientists?

Alexey: You mentioned that in your free time, you work a bit with data science, just to learn a bit of machine learning. We have a question from Prem. The question is “Should data engineers gradually try to transition to become more like data scientists?” What are your thoughts on this? Should data engineers get to know more data science to be better data engineers? (27:30)

Roksolana: I would say that it's good to know more about how the machine learning cycle works. For example, I don't really work with the internals of machine learning models. I don't get into that, but I have knowledge of each step. Recently, I started to learn how to deploy that – how to build the whole pipeline. So that data scientist would only need to build the machine learning model. I would say that it's important to understand each step and all the inputs and outputs of those steps. It’s not very important for data engineers to understand how the model actually works. We're more on the software engineering side here, like “What is the input? What is the output?” and “What happens next?” That kind of thing. (28:02)

Alexey: So you don't need to go into the details of the exact mathematics or even what kind of model they use inside. But you do need to know, “Okay, this is a model. This is the kind of input it receives. This is the kind of output it produces.” And you need to know what to actually do with this. “How do you package this thing? How do you deploy this thing?” And things like this, right? (28:48)

Roksolana: Yeah, I don't get much into algorithms. But it’s important to understand what the model actually does, because it influences the pipeline – this knowledge help actually predict what can happen inside the pipeline or in different cases where the unexpected happens. (29:13)

Alexey: Do you think it would be beneficial for data engineers to actually learn these internals? Maybe not in great detail, but at least how random forest works, for example, or how logistic regression works. Or for data engineers, it's not really that important? (29:29)

Roksolana: I would say that it's not really important. It’s more part of the person’s interests. I know some data engineers that are interested in machine learning and they try it out. I think it's helpful if they work closely with data scientists. It's also kind of nice to have a different perspective. While a lot of data scientists have more mathematical backgrounds and analytical skills, data engineers come from more software engineering backgrounds, and have the corresponding point of view, which can be also helpful if people collaborate on building such solutions. (29:48)

Alexey: I guess, if we reverse it, the same is true for data scientists. It would also be beneficial for them to know the data engineering part of things. Maybe not just “Okay, this is how I use this function in PySpark.” But maybe a bit more detail, right? It's not required for them to do this, but it would be beneficial for them to get some idea of how it works underneath, right? (30:21)

Roksolana: Yeah. I agree with you here. (30:51)

Advice for analysts and scientists for transitioning into engineering

Alexey: Now, we have a related question from Steve, “What advice would you give to analysts or data scientists who would like to transition into data engineering? (30:53)

Roksolana: I would say that the most important thing is to work on having a great level of coding skills. When I was mentoring a data analyst, I noticed that she doesn't have the same background in software engineering, and therefore, she sometimes couldn't understand how the actual algorithms work, which seemed to me as quite basic. From my perspective, I didn't really understand that. But she had more of a mathematical background. Therefore, I think it's very important to have more experience here and learn on that side. Aside from that, I think – databases. Maybe a bit of the infrastructure side as well, if this person would like to get involved in deployment and the set-up of the jobs. (31:05)

Alexey: “Algorithms” is quite broad. What kind of algorithms do you think will be most beneficial to know? Just basic data structures and algorithms, or? (31:53)

Roksolana: Yeah, exactly. Some basic data structures, sometimes even the code reviews. It can turn out that someone could write something in a more performant way, just by reusing this knowledge from the algorithms and data structures. (32:04)

Alexey: It’s like you said, at least know how hashmap works, and things like this. Are there any other things that would be beneficial to know from these algorithms and data structures? Would they need to go into graphs, for example? Or trees? Or not at the beginning? (32:17)

Roksolana: Depends on what they’re working with. For example, I got into graphs because I was working with graph databases – no-SQL databases. Before that, you wouldn’t need to know much of that. Also, it's not that often used. I didn't notice that in other companies they would graph databases. Mostly it's about the complexity of algorithms. It's not just knowing them by heart, but rather understanding why it's better to use this data structure or this algorithm. Also, it's a lot in the context of how the programming language works. In some programming languages, it's different – some data structures are named differently, some are implemented differently. It's important to know which one is better to use. (32:43)

Alexey: From the top of your head, do you have a list of data structures that you “must know” to get started? In addition to lists, sets, and hashmaps that we mentioned. (33:30)

Roksolana: Get it from Scala. We have sequences at least, arrays. But in functional programming, we don’t use arrays much, so it's also different depending on the programming language you use. I would say it’s specific to the language. In Python they have different names for the data structures. (33:51)

Alexey: I guess the sequence you mentioned, I think is more like a link list, right? (34:11)

Roksolana: Yeah. (34:16)

Alexey: Okay. Then you have a different type of list, which is an array. Right? When it comes to sets, there are different ways of implementing a set. You can have a tree base set, or you can have a hash base set. Do you think that this is also good knowledge to have? Or not really? You just need to know that set is faster than list? (34:18)

Roksolana: I would say that just knowing that a set is faster than a list is enough. In some cases, instead of sourcing, you could just use a set. Sometimes people do that. (34:43)

Database recommendations

Alexey: When it comes to databases – what kind of databases would you recommend to learn? Just pick any, or? (34:53)

Roksolana: I think that PostgreSQL is probably the easiest one to start. Or MySQL. There’s not that much difference between the two. I would say that it's also nice to have at least one no-SQL database, just to understand how it's different from SQL databases. It becomes easier to switch to something else. (35:03)

Alexey: Which one would you recommend? Mongo or? (35:21)

Roksolana: I guess Mongo is easier to start. I had a lot of fun with Neo4j, but it's very specific. Not that many people use it. (35:26)

Alexey: Neo4j is a graph database, right? Then you also have some Mongo – I think it's called Document Database. There are so many no-SQL databases – there’s this Document Database, databases like Mongo, CouchDB. You also have Key Value Databases. Maybe it's a good idea to try to play with at least one, or? (35:32)

Roksolana: Yeah. Just to understand why it's called no-SQL. Because when you just hear about them, it's hard to understand and map it to reality. (35:55)

Data engineering and infrastructure

Alexey: Now from the infrastructure side of things. I think you mentioned that data engineers need to know networking and things like that. But when it comes to data scientists who want to transition, they maybe don't need to get into the nitty-gritty of networking and know the famous “OCI stack”? Maybe they don't need to know all the seven layers by heart. (36:07)

Alexey: But what kind of things do they need to know from the infrastructure point of view? Things that they can learn to know a bit of data engineering and to be successful in that.

Roksolana: I think Docker is a must. Because data scientists also use Docker, as far as I know, for at least testing or deploying. Also some tools like cloud services, which would help understand this abstraction on top of the hardware. Cloud services like AWS, Lambda or something like that. I would also recommend Kubernetes, but it has a relatively high learning curve, so it's not for everyone in terms of starting out. But it's necessary to use Kubernetes. It's becoming more and more popular. I would recommend trying now to learn that. (36:45)

Alexey: What's the easiest way to try out Kubernetes? (37:25)

Roksolana: I think just installing it and trying it out. Even having a small cluster with one node and setting up something and seeing how it works. For me, it’s most interesting if it’s more connected to my work. When I was just setting up databases, it was kind of interesting, but it's not really practical if you don't have to use that in your work. But in data science or data engineering, sometimes you have to deploy some pipelines using Kubernetes. Trying that out can help to actually learn that as a framework and the pipeline. It helps to get a better understanding of how it works behind the scenes. For example, Spark in Kubernetes is quite useful for future understanding of how Spark would work with other providers of the data. (37:30)

Alexey: If a data scientist or analyst wants to transition to data engineering – first, they need to know the basic data structures: lists, sets, maps, etc. Then they need to pick up one relational database, PostgreSQL or MySQL. Then they need to pick one no-SQL database, for example, Mongo – and play around with it. Also, they need to try to understand what kind of categories there are in no-SQL databases. We mentioned document, key value, etc. Then from the infrastructure side, they need Docker, Cloud, and maybe to play around with Kubernetes. Even though there is a high learning curve, it's still useful to try it. As you said, you can just install it locally and play with it. Right? (38:18)

Roksolana: Yeah. (39:08)

Monitoring and alerts

Alexey: Another question we have is, “How do you deal with data quality checks? What kind of monitoring do you recommend setting up?” (39:09)

Roksolana: I would say you would need to set up monitoring of how the data flows in the pipelines. For example, monitoring that “Today, there is no data in this pipeline.” or “There is too much data – more than you expected.” In other words, monitoring spikes in the data. This is also important because it helps to optimize pipelines in the future. It would also be good to monitor the schema changes, at least because monitoring data changes is way too complicated. But monitoring schema changes is quite helpful. (39:28)

Alexey: Schema changes? What’s that? (40:02)

Roksolana: For example, in some data formats – like Avril, Parquet, or ProtoBuf – they have either separate files with schema, or they are defined in such a way that you can get it separately aside from the data, and track whether it changed or it didn't change. (40:04)

Alexey: Like if someone renamed a field, for example, or removed it? (40:22)

Roksolana: Yeah. Or if someone renamed something, which also happens. We had a case like that and it simply was very not cool. (40:25)

Alexey: If that happens – if the schema changes – if somebody renamed something, what do you do? Do you just send an alert and abort your job? (40:33)

Roksolana: In that case, we didn't have a setup for CSV files, because they just have this header and we just noticed that the pipeline doesn't return the same results as it used to – it wasn’t the same amount of data and it gave some empty values. Therefore, the bug was because some naming changed. (40:41)

Do data scientists need to set up monitoring?

Alexey: I guess this kind of monitoring is more on the data engineers. Do data scientists also need to do a bit of monitoring for their jobs? (41:00)

Roksolana: I think they can just do monitoring in terms of whether the model returns results and whether it produces expected results. I would say that another part would be monitoring in terms of resources. Something like that would be more on the data engineers or machine learning engineers. At least that's the way it works in another team, where the data scientists are in my company, (41:11)

Alexey: Do you think the tools you would use are the same? That data scientists and data engineers would use the same tools? Or are they different – the tools for monitoring quality? (41:34)

Roksolana: I think that it could be the same. For example, some alerts or metrics-gathering for something either regional, like dashboards with Grafana, or other tools, or Tableau. I think it's quite universal in this case. (41:45)

Do data engineers depend on data scientists for something?

Alexey: We have an interesting question. As you said “Data scientists depend on data engineers.” Let's say a data scientist needs a new field, they would open a Jira ticket for data engineers to this. Is there anything for which data engineers depend on data scientists? (42:04)

Roksolana: I think it depends on the workload. For example, if you would need to get the results of the models in some way, we will definitely depend on data scientists. Right now we depend on either the source of data, which is third-party clients, or the database where another team pushes the data through. We kind of already depend on another team. But on the data scientists side? Definitely, if you need to get the data from them. For example, the case that we described with recommendations – if you want to build some reports from it. Before that, we need to do some transformations for other people, such as analysts, and therefore it would depend on the result of the model. (42:26)

Alexey: But I guess what happens more often is that data scientists depend on data engineers. Because to train a model, you need to have some data and if you don't have data engineers, you don't have data. Otherwise data scientists have to somehow get this data. Because of this dependence on data, it happens more often that the data scientists depend on data engineers. Would you agree? (43:08)

Roksolana: Yeah, exactly. (43:34)

Data documentation

Alexey: We have a question about data documentation, “How do you maintain data documentation? Do you think it's important to have it and what would be your recommendation there?” (43:37)

Roksolana: I would say it's important just in general to have it. But a lot of people don't like to create documentation. Therefore, I noticed that it's important in many companies, especially outsourcing companies. If you have a fixed-time project or just don't have time for documentation, which can also happen. In general, I would say that it's important to have some schemas documented at least. We do that in terms of the fields, like “What is this field? What does this table have? What kind of data is stored here?” There are examples of values for each field, which is also helpful if you see some anomalies there or it’s just helpful for testing and having expectations in terms of these values. Yeah, I would say that it’s important to document schema or just documentation in terms of what happens inside of the pipeline. (43:50)

Documentation tools

Alexey: What tools do you use for that? Would you just create an Excel spreadsheet with that or maybe use a specialized tool for this? (44:41)

Roksolana: For schema documentation, we use hype SQL files description. So this would be things like the description of each field, the name, and the type. As for the pipelines documentation, it's in Confluence– just documents with some graphs or schemas, if it's necessary. (44:54)

Alexey: These hype SQL files, I think this documentation is optional. But you say that, “OK, it's our standard that we put documentation there.” Is that right? And then you generate documentation on these files. (45:13)

Roksolana: Yeah, in such situations, it’s our decision to build this data governance solution so that we have a better understanding of what kind of data we process and what kind of data we have. Instead of just multiple people having multiple tables and no one knows what's in there. (45:28)

Alexey: I assume the data scientists also produce a lot of data, or rather their models produce data. They also need to document this data, right? Do they use the same tools for that? (45:46)

Roksolana: Yeah. They just use the same type description or documentation in Confluence. (45:59)

Alexey: So you have a central place for documentation for your data and the data that data scientists produce? (46:05)

Roksolana: Yeah. (46:13)

How much data engineering should a data scientist know?

Alexey: We have a question from Akshot. “How much data engineering should one ideally know?” I think we've covered that a bit. Yeah, we talked about how much data engineering skills data scientists need to know, ideally. We were talking more about something like, “if a data scientist wants to transition into data engineering.” But to be able to successfully do work as data scientists, what kind of data engineering skills should they have? (46:14)

Roksolana: I would say that good coding skills are also important. I know that some data scientists are somewhere in the mathematical side, and they are more interested in building algorithms rather than writing code. It actually influences the quality of the creation. Because they can either build everything in one notebook, and it would be hard to deploy it in some meaningful way. Or they would build the whole solution with libraries and classes in Python, maybe object-oriented programming. This would be more of a software-engineering way to do things. I think that data scientists need to know databases as well, because they need to read the data or write some results there. (46:58)

Alexey: Okay, so basically, “improve your software engineering skills.” (47:39)

Roksolana: Yeah. (47:43)

Alexey: I think the trend that I see now in most companies is that data scientists are required to be good developers as well. Maybe they don't need to be as good as software engineers, but they need to be decent with coding. (47:46)

Alexey: There's a comment about getting started with Kubernetes. There is a good resource for that called Katacoda. I think I saw it before. Have you seen it? Katacoda?

Roksolana: Yeah, I tried it once. In the beginning, it's quite useful. You can just try out different commands and see what happens. (48:18)

Trying out Kubernetes

Alexey: I think I saw one with Kubernetes and I think one with Kubeflow. It's pretty cool. They just set up a local Kubernetes for you in a browser. And then in this browser, you get a terminal – like a Linux terminal. You have KubeCTL there with these Kubernetes and they basically tell you what you should do, like “Execute this command. Execute that command.” And then you get this feeling of satisfaction from it. (48:26)

Roksolana: Also, Google Cloud sometimes has some Code Labs. They have documentation, for example, of Kubeflow, where you can try out on each step. Some of them are free, so it's possible for anyone to use them. Databricks has trainings as well, but they are usually paid. Sometimes they allow people to use some of them for free if it’s for some conference attendance. Lately, Spark on it is free, and people who attend the summit can use some of these trainings for free. (48:57)

Choosing a path for graduates – engineering or science?

Alexey: There is a question that I wanted to ask you – I just remembered it. Let's say I am graduating from a university. So I'm studying computer science. I graduated from university, I studied computer science. Now I need to make a choice. What kind of position do I want to take? Do I want to do data science or I want to do data engineering? Which path would you recommend and why? Maybe you can also suggest how people can find out what they're more interested in? (49:29)

Roksolana: Yeah. Exactly. I would suggest first finding out what's more interesting to the person. I personally chose big data engineering because for me, data science was a bit ‘not determined’. I think that software engineers would understand me. It's hard sometimes to explain why a machine learning model does what it does and why these results are this way. Big data engineering is more like software engineering – it's more determined. For me, I was just interested in software engineering in general. I wanted to work with some different tools and different tasks. Therefore, big data engineering is a great way for people who are getting bored of backend engineering or they can just go right from the university and learn more of big data and go into big data engineering. If the person is more interested in building algorithms, like mathematics, or machine learning, or just building “fashionable” things, like computer vision or deep learning, then it would be a good idea to start with data science. (50:13)

Project recommendations to see if you like data engineering

Alexey: Is there any project you would recommend that people try in order to understand if they like data engineering or not? It sounds cool, right? If you look, people say, “Hey, data engineers are so important. Data scientists really depend on them.” And many people think, “Okay, it's cool. I want to become a data engineer.” But maybe they don't really understand what this work requires. Do you have any ideas about maybe a small project that they can do to try to find out if they like doing this kind of stuff or not? (51:16)

Roksolana: I would say even building a simple word count and getting some transformations on that would be a good start. You could build it in a more complicated way and trying to improve it with each step and build the pipeline around it. There are so many ways you can do that. Either using HDFS, like MapReduce Standard, or Spark and then try to optimize some algorithms. There’s a project that I enjoyed working on – I was at university and was just working with streaming data. For example, Twitter has an open API. You could just read the data and build some analytics on top of that. For example, “How many users tweet about this?” “How many users tweet from this location or another one?” It's quite fun in terms of seeing the results and getting to know how some frameworks work. You can just deploy it or write the results somewhere, which is a great practical way to see how it works. (51:49)

Alexey: About this Twitter project – you have a stream of data from Twitter, you build a streaming pipeline to processes this data, and then display it in real time on some sort of a dashboard? (52:43)

Roksolana: Yeah, I used Elasticsearch, which was quite easy to connect to Spark. You just dump the data into Elasticsearch and have it displayed in Kibana. This is easy to try and also to visualize results of work quickly. (52:56)

Alexey: That’s a cool project. For the word count, you just take a text document and you count how many times each word appears, right? Do you know any documents that people can take to do this? (53:11)

Dataset sources

Roksolana: I think there are plenty of data banks with some texts. A typical example is some Shakespeare works, but maybe it's more interesting to take some big article, so that you have more data to process. Or you could even take some scientific article as well – get some words from there and analyze that. (53:28)

Alexey: Well, Shakespeare I think is the opposite of the big data, right? How many kilobytes are there? Even though he might have been a productive writer, but it’s not really big data scales. One thing I know people could do is to process Wikipedia. It's also a bit challenging because Wikipedia gives you just one big XML file that you need to figure out how to process. But I guess this is the kind of thing that a big data engineer would need to enjoy doing. If you enjoy figuring out “Okay, I have this big XML file. How do I actually read it?” then you’re probably into data engineering. (53:52)

Alexey: Then there is a data set – maybe you’ve heard of it – it’s called CommonCrawl. I think they just get a copy of the internet. There is a crawler and then they go on the internet and save all the pages they see. You can just download these pages and every month they generate terabytes, or maybe even petabytes, of data.

Roksolana: Also, there are sometimes some scientific open API's. I found a NASA API about their discoveries or some results from their researches. Some data from Mars can be found, in terms of a topic. (55:05)

Alexey: Data from Mars? So they publish all the data from Mars. (55:23)

Roksolana: I think partially. Not all of the data, but they publish some data from various researches. (55:27)

Alexey: Wow, that's cool. Do you know if they publish it in real time? No? (55:34)

Roksolana: Yeah, I think some of them are in streaming because I was looking at the time into streaming API's. (55:39)

Alexey: How cool would it be to build a streaming pipeline to actually get data from Mars and process it? Maybe it's not super high volume, but it’s quite cool. (55:45)

Roksolana: Also, it's possible to just parse some social media sites. Instagram, for example, has an API. So you could parse some information about the posts or from Facebook. (55:57)

Pre-built tools vs hiring a data engineer

Alexey: Right. So in social media, there is so much data to take advantage of. We still have some time and we have quite a few questions. (56:08)

Alexey: One question is, “What are your thoughts on companies that use tools for ETL instead of hiring data engineers? And what are the advantages of outsourcing this to these companies and these pre-built tools, or hiring data engineers and building your own data pipelines?”

Roksolana: I can say that I have some attitude towards that. I think that is just one way to solve an issue. I would say that I noticed this approach for startups or projects that have a fixed amount of time and need to deliver something real quick and get the results real quick as well. In such cases the pre-built approach works really well, because you can just build an AWS or Google pipeline that has all their services, like constructor details. That's it. I would say that it doesn't really scale well, however. After some time, there’s always something that should be customizable in each product. Sometimes you have new features of a product, new parts of a product, and therefore you can’t just rely on the cloud services. So this approach usually works on smaller scales. That’s my opinion. (56:42)

Alexey: I remember there's a tool from Microsoft that has this drag-and-drop feature. You just drag-and-drop these boxes, which are components, and you connect them with arrows. To add a deduplication step, for example, you just drag and drop ‘deduplication’. Then you also can do this with fuzzy lookup. You can build pipelines that are quite complex there. I guess at some point they become less flexible, right? (57:33)

Roksolana: Yeah. (58:03)

Challenges in the work of a data engineer

Alexey: A question from Less, “What are the most challenging tasks for data engineers in their daily work?” (58:05)

Roksolana: One of those is deduplication of data. Sometimes it can be quite a complex condition in terms of how we want to track this deduplication. Another one, which I often have to do, is historical data processing. It's usually for batch jobs. For example, there was some mistake one month ago in the data, or something changed, and you need to go back in time and remove these data and reprocess it back. It's complicated because it's always very customizable in comparison to just running a job. You have to set up some data limits, the ways you want it to go back, the ways the solution changes. Then you also have to see that it actually works well and that you don't need to do that again. It's very resource-consuming because it usually consumes big periods of time – at least a week, or sometimes months, or more than that. (58:16)

Alexey: Yeah. This historical data reprocessing – do you have a way of dealing with this? Like a standard approach? Or you do something different every time? (59:15)

Roksolana: Yeah, there is a standard approach, but I would say that it currently has a lot of manual steps. We are on the path of trying to eliminate that. Some of those steps require some work from the infrastructure side because in some systems where you need to delete the data. Your infrastructure team hates when this happens because it's quite risky to delete some data in production systems and rewrite it. (59:29)

Alexey: Yeah we had a talk recently in DataTalks.Club about that. I think it was called Data Historization, which allows us to actually do this. For those who are interested you can check. But it's indeed a complex topic. I remember there was one slide with the organism that describes the process and it's quite complicated. (59:57)

Roksolana: Delta Lake helps now with Spark for tracking versions of data, which we're also trying to use to automate and see how the data changes. You can kind of travel back in time to different versions of data there. So it’s really useful. (1:00:25)

Alexey: Maybe one last question for you – “Do you know any course about data engineering that could be useful for data scientists?” (1:00:40)

Roksolana: I would say that there is one specialization I would recommend – Big Data on Coursera – it’s just called Big Data Specialization. That's the one I started from. It was really great because the first course in the specialization is purely theoretical. So if the person doesn't want to go into specific practical tasks, they can just go through the first course where all the tools are explained and the kind of tasks that data engineers do – they explain everything. It gives a great understanding and perspective of how everything works. The next courses are more practical and have specific tools and solid specific tasks. (1:00:52)

Alexey: Thank you. Do you have any last words? (1:01:31)

Roksolana: Just in general, I also would suggest reading books as well. I find that it gives a deeper understanding sometimes than courses or lectures. There are a lot of books on Spark or Big Data in general, which cover quite a lot. (1:01:36)

Alexey: Any specific book that you would recommend? (1:01:54)

Roksolana: Specifically, I enjoyed Spark books, like High Performance Spark or Learning Spark. On big data there’s this book by… I don’t remember the name – it's called Data Intensive Workloads, something like that. (1:01:56)

Alexey: Building Data Intensive Applications, right? (1:02:14)

Roksolana: Yeah. Something like that. Yeah. (1:02:15)

Alexey: It has a pig on the cover. (1:02:16)

Roksolana: Yeah. There are a lot of great books on that. (1:02:19)

Alexey: The author is Martin Kleppmann. Designing Data-Intensive Applications. That's a good book. (1:02:22)

Alexey: How can people find you?

Roksolana: I am on Twitter, LinkedIn. Also, you can find my talks on YouTube, because most of them are recorded. They’re about big data or about Kubernetes as well. (1:02:34)

Alexey: And Alice. (1:02:44)

Roksolana: Yeah. [laughs] (1:02:45)

Alexey: Alice and Kubernetes. (1:02:47)

Alexey: Yeah. Thanks a lot for joining us today and for sharing your knowledge with us. And thanks everyone for joining us and asking questions. I just want to remind you that tomorrow, we have the last day of this conference. We will talk about building a machine learning startup. So don't forget to check it out. I guess that's all thanks. (1:02:48)

Roksolana: Thank you. I enjoyed it. (1:02:17)

Alexey: Yes, me too. Have a great evening. (1:03:19)

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.