LLM Zoomcamp: Free LLM engineering course. Register here!

DataTalks.Club

The ABC’s of Data Science

Season 2, episode 7 of the DataTalks.Club podcast with Danny Ma

Did you like this episode? Check other episodes of the podcast, and register for new events.

Transcript

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Becoming a meme-generating machine

Alexey: Before we start, I would like to ask you about your background. Anyone who follows you on LinkedIn knows that you're a meme-generating machine. Every time I open my LinkedIn feed, I see a meme from you. And it's awesome. I want to know, how do you get inspiration for them? It must be difficult to create so many memes. How do you do that? (49.0)

Danny: The inspiration for my memes mostly comes from things that I've experienced in real life. Whether it's someone doing some dodgy practices with data science, or they're looking to try and trick some management by telling them something, or whatever it is. I try and draw on my own experiences to poke a little bit of fun. Everyone knows that the funniest memes are always based on truth. Usually, I need to think about things that I've done in the past, or I've seen people do incorrectly in data science. And things just come to my brain. (1:21)

Alexey: You must have seen quite a few things that people did incorrectly. (1:58)

Danny: I believe so. (2:02)

Alexey: If you came here because you saw a post from Danny, you probably saw the video that he made. I still cannot unsee this video. I will include a link in the comments. Be careful. Once you look at this, this image will stay in your head. It will not go away for a couple of days. Believe me. How did you learn to do these things? This is not something that an average data scientist can do. (2:04)

Danny: I'd call it “creative video editing”. I spent too much time on social media consuming lots of different content. I feel that I've developed a strong intuition for what is interesting, or at least what is different. I just try and do the video editing to make it happen. I'm still very junior when it comes to video editing. I have a friend who actually works for the Australian Broadcasting Company. He always tells me “I should give you some lessons on how to edit properly.” I have no idea what I'm doing. (2:45)

Alexey: This is amazing. Was it Sesame Street? This is pretty much on topic to our discussion today. (3:21)

Danny’s background

Alexey: On a more serious note, maybe you can tell us about your career, apart from the most useful skills you have that we already talked about. It would be quite interesting to know, what was your career journey so far? (3:33)

Danny: My name is Danny Ma. I would call myself a “recovering data scientist”. I've spent the past five years working in data science. But I actually started off my career working in data analytics, specifically in campaign analytics. I was the one trying to identify which customers to send different emails, based on their shopping behavior. I did that for two years. It was my first job out of university. For university, I took an undergrad in commerce, majoring in actuarial studies. That's the maths and the statistics behind the insurance models and financial models. I don't think I actually remember much from my university days. I learned almost everything on the job instead. When I was at university, I was too busy partying or doing something crazy. (3:51)

Danny: As part of my studies, I managed to get a few internships working for insurance companies. This was in the actuarial department where they would have lots of crazy Excel. They used a lot of SAS — statistical and analysis software — to model the claims and different things like that within the insurance world. They didn't let me anywhere near it. I didn’t learn much when I was doing my internships. But I got exposure to some of the data stuff there. I also realized that I didn't want to work within the insurance space. I wanted to do something a little bit different, maybe like a non-traditional use of data. (4:56)

Danny: That drew me towards my first role outside of university, which was with Quantium, a data consulting firm within Australia. There I did campaign and retail analytics as well. Think of consumer packaged goods and different things that you can buy in a supermarket. I did that for two years. I stagnated a little bit, and I wanted to find a more challenging thing to do. Around a year and a half into my role there, I started waking up at 5am to learn Python. I had very little programming experience before that. I dabbled with a little bit of R when I was at university.

Alexey: What did you use as an analyst? SQL? (6:32)

Danny: Most of it was SQL and Excel. I used SAS because that was the only thing they gave us in that company. I was automating a lot of SQL jobs through whatever tools I could get my hands on. But my SQL code back then was very, very bad. It must have been not optimized at all. I had no idea what I was doing. (6:35)

Danny: After I spent my time learning a bit of Python, I asked some of my mentors, “What do I do? I don't feel like doing this campaign stuff much longer”. They told me “You should do data science”. Back then I had no idea what data science was. I googled “data science”. And I started with Kaggle, trying to do different things. I made almost every mistake that every new data scientist would have made on their career and their journey.

Danny: Then I applied for a few data science roles. I didn't have any experience and I got rejected for all of them. I eventually made it into a role with the bank — one of Australia's largest banks within the digital area. I was doing more along the same lines of what I was doing before but in a different space. Think of all the campaign targeting and looking at the different click funnels and different things like that on their website.

Alexey: The same thing, but this time with Python? (8:10)

Danny: This time I tried using Python, but it was not very good. (8:13)

Alexey: So, SQL again? (8:17)

Danny: Mostly SQL. But we had an opportunity to apply data science to some of the problems within that digital space. I got to work hand-in-hand with the data science team. After that project, I got absorbed into the data science team. That was around when I became a data scientist — by title only. (8:19)

Alexey: Which year was it? (8:44)

Danny: It must have been 2016. (8:46)

Alexey: Still around the time when companies had no clue what data science is? (8:53)

Danny: I think so. Things haven't really changed much... (8:58)

Alexey: Yes, exactly. (9:01)

Danny: After I moved into the data science team, I started working on more traditional machine learning projects. Think of all the customer propensity modeling, looking at time series, forecasting. We actually did a lot of experimentation as well. It was still within the campaign and the marketing space, but it was much more rigorous in terms of how you would run the experiments and how you examine the results — with all the statistical testing and target/control groups. I spent a lot of time doing that. (9:06)

Danny: Towards the end of my three years in the data science team at the bank, I started working more in data assets. It was more of an exploratory data role. But I would also be implementing the final data products that we were using across the different teams. One of my big things was to try and capture customer interactions around the bank. Imagine someone takes out some money at an ATM. They use their card to pay for a coffee down the road. Then they make a complaint on the website, because something happened with their transaction. I'd be able to track all of that at a customer level and draw it on a timeline so people can start analyzing the journeys. We were using all of those events to create massive feature sets and machine learning training data. We were using that quite often. (9:45)

Danny: Soon after I built that product, I got an opportunity to do the same thing, but at a retailer. This time I was doing it on the cloud with more digital data sources — Google ads, Facebook ads, Instagram as well as all their shopping data and whatever else they had at a customer level. We got to build that in GCP. It was a really great project. It was quite challenging. The team used that data to build multi-touch multi-channel digital attribution models. My last projects were around that space, where it was all about combining data, automating machine learning pipelines and doing everything on the cloud.

ABC of data science

Alexey: We will talk about these ABCs a bit later, but you captured at least two of them — A and B. Let's take a step back and talk about these different types of data scientists. (11:29)

Alexey: The topic today is the ABC of data science. What are these ABCs? I'm also curious how you came up with this idea? (12:00)

Danny: There's an article written from a data science team at Airbnb. They defined two types of data scientists: type A — the analyst type, and a Type B — the building type. Last year, when I was doing my live streams on YouTube, I added an extra element of Type C. It’s a combination of both, but more within the management realm. The A and the B were already there, I took them and talked about them a little bit deeper and then I added the C. I haven't seen other people talk about the C as of it yet, but we'll see. Someone might claim it. (12:18)

Type A — the Analyst

Alexey: You can say, “I was actually first. Here's my video”. Let's talk about these A and B and C. What are those? (13:07)

Danny: Type A stands for Analyst, B stands for Builder. When we think of the analyst type, it’s something similar to the experience I had. I came in doing a lot of data analysis, being able to look at insights, do data visualization, make some dashboards, and try and figure out what's going on with the data. That's a very large component of data science. We can't really be doing machine learning if we don't know what’s the problem that we're solving. A lot of the upfront work and the investment of time and energy for many of the projects will be done on analyzing data and making sure that we're solving the right problem. (13:17)

Danny: The analyzing type is coming from a data analytics background, maybe with a statistics background, very research-heavy as well. It’s the origins of the first few data scientists that ever appeared in Silicon Valley. This was the first time that the term was being used. They brought on PhDs, ex-PhDs, ex-researchers from all the powerful universities like MIT, Stanford, they put them into their teams in LinkedIn, Facebook, and Google. They were there to do research and try to figure out where they should invest money or how they can use the data to build a new product that will help the customer experience. You ended up getting things like the Facebook Knowledge Graph, the LinkedIn Network Graph. I think Google PageRank was one of the original large scale machine learning products as well. So when we think of the analyst type, they were that type of people. They come from a very heavy research background.

Background for the type A

Alexey: As I understood, data science type A is not necessarily a data analyst. This is a different role. The first data scientists ever who appeared, they were of this type. The other types followed after that. What kind of background do they usually have? PhD, people who are doing research? Is it the only type of background they have? Or are there some other backgrounds? (15:22)

Danny: The originals were mostly coming from that strong technical pedigree of having all the training, doing a lot of research, and writing quite a lot of code. Over time the companies realized that these people were really rare. Most of them were very happy to continue doing research, instead of moving into the industry. A lot of the other people who are working with data originally — your traditional analysts, your quants, maybe some of the database people as well — they started converging onto this role of “data scientist.” Most of them came from that analytical background. They were very good with the data. It made perfect sense for them to slot in and start picking up some of the other skills that those advanced PhD researchers were already using. (16:01)

Danny: This could have been Python, R – all those different languages, which are additional on top of the traditional data analyst skillset of using SQL, Excel, Tableau, data visualization tools. It was most of the background of the people. Their actual study background didn't matter so much, it was more about their work experience. You still have people who are doing technical degrees — computer science degrees and statistics and mathematical degrees — that would also slot into the A type. They blend in with the B type, which we'll cover soon.

Alexey: The skills the type A data scientists need: first, how to program with Python, or R. Then they need to know the theory. Usually they were originally coming from these big famous universities. They need to know the theory of machine learning, how to experiment, statistics, designing an experiment and things like this. Is there something else that they also need to know? (17:38)

Danny: The analyst types also have to present a lot of their insights to the business. This also ties in with the Type C data scientists who specialize in communication. When you're doing a lot of analysis, even as a data analyst, you're expected to present your findings to the business — to drive a commercial decision or to drive some direction in the project. So communication is also very important. But to round out the skill sets for the type A data scientist — most of is essentially a data analyst plus additional bits and pieces that come from the research background, a bit of programming, and more emphasis on the communication than before. (18:20)

Learning path for Type A

Alexey: Analysts are already good at this storytelling and visualization. They need to add a bit of coding on top of that. And if they don't know theory, they also need to study more statistics. Usually, data analysts already know a bit of statistics, they need to learn a bit more, right? And also learn a bit of machine learning. (19:11)

Alexey: Let's say we have a typical data analyst who knows how to write SQL queries, who knows how to create dashboards, who knows how to do nice data visualization. They want to become the type A data scientist. What will the learning path look like for them?

Danny: That was exactly my path. When I started my journey into data science, I could do SQL (very badly), Excel, dashboarding, a little bit of R. But I had no idea what the other skills were. A lot of my journey was essentially just trying to figure out what I should learn. I got that direction when I started working with people who were in a data science role and had data science experience. They could guide me on what I should be learning. That was when I saw my own growth and development really pick up like a hockey stick. I felt that before I had that guidance, I was fumbling around in the dark, reading a lot of different articles and being stretched everywhere. You hear like, “You need to understand all the theory before you even start doing any programming.” And then someone else would say, “No, you implement machine learning things, and then you pick up the theory.” (20:01)

Alexey: That's what I say usually. (21:06)

Danny: That’s what I would say as well. Go and try a few things. You'll know when you need to learn the theory — when you'd have no idea what's going on. That was the same advice given to me. I would say for anyone who's working with data, visualizing and whatever, there's definitely a path into data science. You have to be curious. Keep working on it. Work with strong mentors who can give you that level of guidance to push you in the right direction. (21:08)

Becoming curious

Alexey: How can we formalize this “being curious”? Let's say you're trying to put a learning plan. You put Python there, you put machine learning theory there, you put some other things. And then the last point is “I have to be curious”. How can somebody become curious? Or this is something intrinsic that you cannot control? (21:54)

Danny: There's a spectrum of curiosity. A lot of people need to know what's going on to the Nth degree. I've met people who wanted to do data science. They started learning Python. Then they wanted to figure out how the Python code got translated into some compiled and into binary code – something crazy like that. That's probably on the one end of the spectrum. You need to know everything – every single thing. (22:23)

Danny: On the other end of the spectrum, people want to accept orders. They don't take the initiative to push themselves more. It's not a bad thing. They just prioritize different things for their career, or in life in general. My parents are like that. They're just not very curious about many things. They do their nine-to-five, and then they spend time with their family. I think it's very honorable.

Danny: Everyone lies on the spectrum of curiosity. The thing about curiosity is that you can literally just tell yourself, “I want to be more curious. I want to know this more.” For anyone who wants to improve their curiosity and keep learning more different things, you just have to make the decision that you want to learn more things. It's as simple as that. I feel that there's no need to define curiosity. It's different for everyone. But everyone can do their own version of curiosity.

Alexey: I'm working with Python code. This is the other end of the spectrum you mentioned. I execute a line of code. It works. But I start wondering, “Why does it work?” I start digging deep, “In Scikit-Learn, we do this ‘.fit()’. What does ‘.fit()’ do?” Then I go, “I have no idea what this code does. Let me learn how logistic regression works”. Then I learn that, I come back and say “How does what I just learned translate to code?” And try to unwrap this thing. Did I understand correctly? (24:09)

Danny: That would be on the higher end of curiosity. That was how I did it. The first time I was starting to use the machine learning packages, I read the underlying papers within the documentation. But I'm different to many people as well. Many of my mentors did the same thing and that's what they told me to do. For anyone out there — if you need to learn machine learning theory, there's no better place than the papers which are implemented by Scikit-Learn. Or Light GBM or XGBoost — the papers are all there as part of the documentation. It's an amazing resource. You can learn a lot by reading through the math and trying to figure out what's going on. (24:49)

Type B — the Builder

Alexey: Let's talk about the B type – Builder. What do they do? (25:36)

Danny: They're essentially software engineers, focused in the realm of data science applications. You do stuff in the cloud, set up the infrastructure, manage workflows, work with the data engineering team to make sure all the pipelines are solid, force other members of your team to write clean code, unit tests, functional tests and different things to make sure that the code will run in production. All of these things are hallmark traits of a Type B data scientist. They're software engineers. You can also think of them as the “new age machine learning engineers”. They build those systems and the machine learning ops, automated machine learning programs and different things like that. (25:53)

Danny: One of the key differences between the Building type and the Analyst type is that the Building type always wants to work in the production environment. There's almost a different mindset. The type A data scientist will be very happy to play around in Jupyter Notebooks: take data from a CSV file, explore and see what's there, try to apply some machine learning to it straight away — after figuring out what the problem is, of course.

Danny: But the Type B data scientist doesn't want to do that. They think of all of that stuff as a waste of time. They're going to refactor it, to change it somehow. People always talk about it as technical debt. There's a famous Google paper of the technical debt of machine learning systems. That's a really good paper written for type B data scientists. It's a very good thing to be aware of, even as you're starting in the field. You can make really bad decisions upfront, which will cost you a lot of money. If you haven't experienced that before, you will not know how much money it might potentially cost you or your company in the future.

Alexey: You mentioned technical debt. I create technical debt, and type B fixes it. (28:15)

Danny: Almost. The perfect middle ground is where you have a Type B data scientist doing the exploration. They can put in all the rigor and the infrastructure from day one. Then you start exploring and testing models in production. In my journey, I managed to get from the crazy type A data scientist, who had no idea what they were doing, to working in a production environment with software and data engineers. I was forced to see how they work. That's why sometimes I refer to myself as a “recovering data scientist”. Once you get exposure to the way things should be done in a very formal software engineering type of space, you can't unsee it — just like the video for my promotion of this talk. These things are essentially threshold concepts. Someone teaches you something that’s hard to forget about the way that you view the world. (28:26)

Danny: There's a lot of value in becoming a Type B data scientist. When we think of value for machine learning, a lot of it is automated value. Someone has to put in 10 hours of work to get back X amount of dollars. It should be “Okay. I built this software. I built this model. It's going to continuously generate me additional revenue.” Because you're a good analyst, you'll be able to measure what your incremental value is from this model. But if you're a Type B data scientist as well, you can build and automate that whole system and do the measurement. It gives you more buying power as an employee as well. I think it's really important to cover both skills.

Going from Type A to Type B

Alexey: But you mentioned that these profiles, they have quite different mindset. One is more exploratory, and the other focuses more on production. You also mentioned that it's possible — you are a living proof — to go from type A to type B. How can this type A data scientist change their mindset and go to type B? It must be quite difficult. You're used to playing with Jupyter, executing cells in a different order. Take a look at these Kaggle kernels. They're probably created by the type A data scientists. They have a lot of knowledge. But they don't necessarily have this mindset and these engineering skills. How do they go from being researchers to being engineers? (30:26)

Danny: There's at least two ways that I can talk about. One way is — as a type A data scientist, you get forced to work on a project in production. You have no other choice, you'll have your ass kicked by whatever's going on — your model won't run, your data is missing, something goes wrong, the package that you're using doesn't exist on some server. Things will go wrong, and things will go horribly bad. You're learning how to be a Type B data scientist the hard way. (31:24)

Danny: I had a little bit of that experience, but I reached out to people who knew what they were doing. They told me what to do. That was part of my experience at the bank. We didn't have many systems, and we were building a lot of the systems from scratch. Auto machine learning, deployment… We were some of the first teams to put our models into Docker containers and deploy them somewhere, have a Git-version-thing to actually control all the different model versions. Things like that. To some people now, these things now sound basic. We have all these new MLOps things coming in to do that for us. But back back in the day, we were doing all of that manually.

Skills needed for the transition

Alexey: You mentioned Git, Docker, and cloud. You use Google Cloud, right? What are the other things one needs to learn to transition from type A to type B, in addition to these things? (32:51)

Danny: That’s probably another cliché. You have to be curious enough to improve your programming skills. As a data scientist, you use machine learning, write a lot of Python code. It's very easy to trick yourself into thinking that you're a good software developer. I did that all the time. I thought I was amazing. Until I worked with people who were actually amazing. (33:12)

Danny: There's the saying, “If you're not the smartest person in the room, then you're in the wrong room”. I don't always believe in it, but there's always something to learn from other people. If you can learn from the best people out there, that's ideal. You want to challenge yourself by being outside of your comfort zone. The more you spend being the “top dog” — you're the one setting all the standards in the team — likely you're not learning as much as you could.

Danny: One of the major things for me happened when I moved from team to team throughout my career. I was lucky to be paired up with someone amazing. When I joined the bank, I was paired up with someone who was amazing at SQL. I learned how to write SQL. I read back my code and I think to myself, “What the hell is this? It doesn't make any sense.” I do that all the time. The code I'm writing now — in six months, I'll look at it and just won’t know what's going on.

Alexey: Better not to look at it. (34:51)

Danny: Yes, better not to look at it. I would say Git is an important skill. Even for type A data scientists, being good with Git or any other version control is mandatory these days. The cloud is not mandatory now, but it's popular already. In future everything will move to the cloud. Cloud skills are definitely going to be a thing, if they're not already. We would do everything on the cloud. All the databases are on the cloud. We’ll grab the data from the database and dump it out on the cloud, into storage, like Google Storage, or things like that. Then we would spin up a virtual machine on the cloud, load the data in from the storage area. Everything is quite seamless from that perspective. (34:53)

Danny: There are two camps of people who are learning cloud stuff. It's either “this is too difficult, I need a DevOps guy to help me”. There's the other end, where you overestimate your ability to do cloud stuff and end up costing your company a lot of money. There's two different spectrums. But in general, the cloud is not too difficult to learn. It's one of those things which you learned by doing. The more you expose yourself to these different technologies, the faster you will be able to pick it up.

Danny: The beauty of it is you can learn for free or virtually for free. You can sign up for a Google account, or an AWS account or a Microsoft Azure account, and you get a few hundred dollars in credit to learn. There's a lot of different online resources as well.

Learning the skills if they aren’t needed at work

Alexey: You got lucky. You had to implement these things. You also were lucky that there were people around who could help you to learn these skills and to master them. I imagine there are many type A data analysts whose work does not require that. They are not pushed by the company to do these things. Even if they are pushed to do these things, they don’t have somebody to talk to. Maybe let's start with the first one. If you don't need to do it at work, how do you develop the skills? How do you learn cloud and Docker if all you do at work is SQL? (36:46)

Danny: This is a challenging one. The challenge is that it will be difficult without a very strong incentive for you to learn. Maybe we can give some incentives here. In the future, data scientists who know cloud computing will always get first round preference, even if other people have more experience. Unless you're well known and you have a good reputation, cloud computing will be a thing. When people are trying to push themselves to learn different things, having good incentives to force you to learn it at all costs is very good. (37:45)

Danny: Another thing: get rid of the lifeboats. You tell your team, “We're moving to Docker.” And then your Dockerize has every single job that you have. That would be extreme. You need to have a lot of trust from your company to let you do that. But it's possible. Take little steps at a time. At this stage, you might not be very familiar with Docker — or with any other tools that you might want to progress in to become type B.

Danny: Reach out to friends who are skilled in that area, who might be working in the same company. They understand the systems. Ask them for advice. “I want to do this, how do I do it better?” The engineers will have some idea about how to do things better, because engineers usually do. You'll definitely be able to learn something which you wouldn’t have been exposed to by talking to other people within the company. If there are no engineers in that company, you can ask a friend or reach out to someone on LinkedIn — somebody who might have some more experience on the thing and solicit them for advice.

Danny: Take things step by step. There's not always a need to cut the lifeboats and the aim for the moon all the time. But it's good to have that mindset and to keep improving. I think that's more important than aiming for perfection from the beginning.

Alexey: I think you answered the second part of that question: how to get people to help you. If there's no one in the company, you reach out to people. There is a question. “How do I find a mentor?” This is an answer? You can reach out to people on LinkedIn. Different communities and other teams in the company can also be an option. (40:06)

Alexey: You also mentioned one thing. You’re at work. All you do is SQL. You can try to challenge the status quo: approach your management and say, “This is how companies are doing this right now. Let me spend some time and dockerize the things we have.” This is how you learn. But management has to have a lot of trust in you to do this.

Danny: You have to build up the trust in your reputation for the business to trust you. One really great example that I've got from one of the people that I've been mentoring on our Slack group. He was working in a role, which was advertised as an analytics role — a statistical analyst role. He ended up doing a lot of reporting work in Excel, which is not great. But he showed some of his managers and his bosses, “I've been learning this stuff in the background.” (41:08)

Danny: He was learning how to do machine learning to predict churn, predict lapse, predict different things like that. He said to his management, “We have a lot of data about our customers. I might be able to figure out who are the customers who might not be buying with us again. We can try different things. I'm not sure if it will work. But if you give me a little bit of time, we can see what happens.” He asked and he was very open about his aspirations to apply more techniques. They let him try it. The project was very successful.

Danny: People need to be curious, take the initiative, and put more of their skin on the line. If there's strong incentives for you to do something, you're more likely to do it.

Type C — the Consultant

Alexey: You mentioned persuasion, which is one of the communication skills. Communication is an important part for type A data scientists. There are also questions in chat about data storytelling skills, communication in general. This brings us to the Type C data scientist. I think for the Type C, these skills are more important than for others. Let’s talk about this role. Who is the type C data scientist? What do they do? (42:38)

Danny: Type C stands for “Consultant”. In the past, a data scientist was the intersection of the programmer, mathematician, and the business domain expert. This type C is essentially that person in the middle, but with very strong consulting skills. (43:12)

Danny: When I was working in data science, the team that I was working in, we were a part of an internal consultancy model. We would go out and try to win projects with different parts of the business. We would work on the projects, deliver value, show them how much value we delivered. Then we would get another project with a different part of the business, with a different part of the bank, because of the results that we already received.

Danny: It was essentially a traditional consultancy where they do the same thing. They are heavy in sales. They have people working and delivering the thing. They always have really strong stakeholder engagement and communication to make sure that the business problems were being solved adequately with the right solutions.

Danny: In the same way, the ideal type C data scientist would have been an analyst and would also have been a builder. They need to know what technical decisions and trade-offs they have to make. Usually, type C's are either people leaders or project managers. Or they're very good at persuading and talking to business. If you're a strong talker, you're more likely to be a people leader. Usually the type C data scientist will be your manager of data science, your head of data science, chief data scientist.

Danny: There's a precautionary tale around the Type C data scientists. It's very rare for someone to actually have gone through the Type A experience and the type B experience, and then they want to go in and lead projects and lead teams. It's a very rare skill set. Not many people want to do all of those things. There's always incentive for people to make it look like they want to do those things. Alexey is smiling. There are many Charlatan data scientists. It would be a senior leader who doesn't have technical experience, but they might be very good with the business. They might be very good at making decisions. But they won't have all the experience to actually make informed decisions themselves. They'll have to rely on their team a lot.

Alexey: To be successful, they don't have to have these skills. As long as there is somebody of type A and type B data scientists around them. They can say, “This is not possible to do. You're selling something that will not work”. (46:12)

Danny: Definitely. Maybe my word “charlatan” is a bit too harsh. Maybe I’ll offend too many people. It's more around the people leader, the business leader, who doesn't have the experience. (46:40)

Danny: There's different types of consultants as well. You can think of them as the true consultants. They aren’t working very deep in the data. They're talking about the process to solve problems, work closely with the stakeholders to figure out what the problems are, make sure that their concerns and issues are all addressed. They're very important. But often we see that some leaders are not very technical. It's not a bad thing. There are some people who pretend that they're technical, but they're not. So it just comes off as a bit of a charlatan mistake.

Is Type C for me?

Alexey: I imagine that this role pays more than the other ones. You have more responsibility. There’s also the skill of persuading people. Communication is something rare among technical people. This is a pretty rare person, that's why it pays higher — and this is well deserved. That's why many people think, “I probably want to go down that road, I want to become a manager, or do this work more.” (47:36)

Alexey: But it's also a change in the mindset. One day, you're type B and building things, and when you become type C, you rarely do hands-on work. This is quite a shift. Do you know, or maybe you have a recommendation for people who think they will like type C work, but they're not sure? How can they safely test this and see if this is something for them or not?

Danny: It's really interesting. For the people who are starting right now. You're coming in as a junior. You don't have too much work experience. You haven't had too much exposure to technical data analysis or things like that. I recommend starting off as an analyst or a builder, depending on the background. If you have software development skills, try and start off as a builder. You're already a few steps ahead on the ladder. If you did an analytics degree or anything with a lot of analysis on Excel or Tableau, start off as an analyst, because — that's where your strengths are. Those two are fine. (48:49)

Danny: But when people start thinking about “I'm in it to reach the top. I want to manage business. I want to be the team leader. I want to manage people. I want to make the big project decisions.” That's not the ideal mindset when you're starting. For most people, when they get to that level, they have a ton of experience within the deep technical realms, or they have a lot of experience coming in from the side as well.

Danny: I've seen and I've worked with really successful managers of data science, heads of data science or whatever. Many of them come in from a nontraditional data science background. They might have done something within the data space, like a head of insurance modeling, or they were working in campaign marketing. They're very switched on commercially. They've got quite a few years of experience in either a non-related or semi-related field to data science, and they moved into the data space.

Danny: The most important thing for the Type C data scientist will be a commercial mindset: commercial acumen of being able to make decisions, people leadership, being able to talk to businesses, convince leaders to do something, as well as the storytelling piece. All of those persuasion components are really, really important.

Danny: But for anyone who's starting off. If you want to end up becoming a type C, it's definitely a great goal to have. But you really want to try and figure out in the short to mid-term, what you want to focus on. It's going to be very difficult if you only focus on doing the people skills and the project management and everything to get to the head of data science role. It's just not going to happen — unless you move around and be a little bit dodgy, it might happen.

From product management to Type C

Alexey: Or a bit of luck also will not hurt. As I understood, a good background for the Type C data scientist is — they either have experience working as a Type A or type B, or maybe both. That’s ideal when they become type C. Is there a different type of background that is suitable for this role? For example, product managers? They already have the skills: they are good communicators, they can tell stories. They don't necessarily have this exposure to machine learning. Can they become good type C data scientists? (52:00)

Danny: I've worked in teams where there was a formal role called “Data Science Product Manager”. That would be the product manager for data science products or data science projects. There would be machine learning, data analytics and reporting. For them, it would be a natural transition. (52:45)

Danny: Let's say you're working in a startup. You're dealing with a frontend facing thing. You’re a product manager, a product owner. You have to manage a lot of the different processes with the technical teams. You might not understand what they're doing with the code. In the same way, these data science product managers usually have the same sort of relationship with their teams. The people who I've seen in that role are almost transplanted out from those teams with traditional product managers and they're told, “Manage this data product thing and learn on the fly.”

Danny: It's definitely possible. Whether you would call them a Type C data scientist or not, that would be another point of contention. For the people who do come from that product management background, if they start learning data science, they’re passionate about learning things like Git, Docker, analyzing data, and visualizing in R, then that's fine. But I find that people who are moving into data space, it's not time efficient to learn those other things. They won't be adding as much value as if they just focused on their roles, like a Delivery Manager. It depends on the amount of flexibility you get in different companies. Whether they give you that opportunity or not.

Danny: I do think it's very useful to learn. Data is the feature, right?

The perfect data science team

Alexey: It helps to speak the same language with data scientists and engineers. I think they do that. We have three roles: analyst, builder, consultant. Apart from these three roles, who else do we need to build a successful product? We talked about the product manager. In chat, people mentioned the data engineer. Who else do we need on the team? (54:48)

Danny: Let's try and build up the perfect data science team to deliver something into production. I used to think of data scientists as a SWAT team. You just drop them into a business problem. Then they figure out what the hell's going on and they tell the business what to do. Then they make it happen. It would be perfect — the perfect business team. (55:34)

Danny: Let's start with the engagement first. You'd have at least one consultant type, or they could be a product manager. But they're very heavy on the engagement with the stakeholders. They would be the team lead, leading the project, leading the team, interacting with the business. I've seen teams where one person is a business lead. There'll be very hands on with the business, very strong commercially.

Danny: They'll have a counterpart, which is the tech lead. You would have a tech lead, who's also good at the business, but very good at the engineering. They should be good at data science. They don’t necessarily have to be very good with the engineering or the DevOps or anything like that. But they'll be able to talk to those different teams. Those two are critical.

Danny: Then you'd have a domain expert: a business analyst or a data analyst that works directly in the team. They'll be talking with the business. Most of the time, when you're working on a data project, your team won't be familiar with the data you need to work with. You need to go out and find out what's going on. You need to physically get it from some silo somewhere in the business as well. You people who are really strong at SQL, strong with communication, strong with exploration skills — to figure out what's going on, read documentation. They need to find out when the data was built, why it was built. What are the processes? What are the ETL stages? All of those things.

Danny: Let’s call this person “a data lead”. They probably work as a data engineer. It'll be a data engineer who’s more on the business side. (57:44)

Alexey: There is a question, maybe we can touch on that a bit? Does data engineer fit into any of these letters? A data engineer is not A, B, or C? (58:14)

Danny: I think a data engineer would be closest to B. I've worked with a lot of data scientists who used to be engineers. They were very firmly in type B. That’s where their background was. They were amazing. They were very smart. If you told them, “You need to apply this mathematical theory or statistical theory to the problem.” They'll just read the source code and be able to implement it. Straight away. Definitely, data engineers as data scientists – very good fit. (58:27)

Alexey: So type B? Irena wrote that she thinks that a data engineer is type B. She's correct. (59:04)

Danny: Correct. 100%. (59:14)

Alexey: So we have a team lead, right? (59:16)

Danny: Back to our team. We have the team lead, the tech lead, the data lead. The data lead may or may not be a data engineer or a data analyst. It doesn't really matter. But they’re the domain experts. (59:18)

Danny: Then you have your technical data science team. You ideally have at least one analyst type. They'd be out trying to explore different things. They have a focus on exploring. They would work hand-in-hand with a Type B. The Type B will be setting up the infrastructure for them, telling them “If you're going to explore like this, let me figure out what data you need. We'll try and fit it into this thing, and then you can start working in Docker from day one.” Or something like that.

Danny: The ideal team. I would say if you have two really strong type A and type B data scientists, you should be sweet. Ideally, you'd have more people learning from them as well. If they were your senior people, you'd have maybe one or two people underneath them, to shadow them and learn. But that should be almost enough for the leanest team that you'd want to try and solve a lot of problems.

Danny: Maybe have one more person — let's call them type A as well — who’s doing a data analytics function. Once the data scientists have been building their models and exploring their models, and using Docker to do all these things at scale, then the results would pop out. They would be analyzing the results with another data scientist who's very focused on doing the measurement piece. Measurement is always left out when we deliver projects. We need to measure the value of what we're doing. How much time are we saving? Are we solving the problem properly? What was the previous rate that the problem was being solved? How much uplift are we getting? What is the cost to my business, my profit? Based on business metrics, and translating all of the work that the data science team is doing into time savings and cost savings.

Alexey: To show that these people are not just reading papers and having fun, but actually solving problems. (1:01:28)

Danny: The Type C data scientists would be perfect in that role as well. Maybe a lot of that measurement goes back to the actual tech lead, who's leading these data science teams. Because they should be able to translate what their team is doing into dollars, because I think that's really important when we're thinking about enterprise data science. (1:01:35)

Getting into data science without domain expertise

Alexey: We have quite a few questions. I think now is a good time to go through these questions and answer them. How can anyone without any domain expertise make a career in data science? (1:01:56)

Danny: There are consultants. Their whole career is being very good with the technical part but not having to dive deep into the domain expertise. You could be a very successful data scientist working in a consulting company. You're moving from one domain to another. You don't necessarily need to be very strong in a single domain. But you have to have skills where you can skill up quickly, in any domain. That in itself is a very specialized skill. When we think of the big three — McKinsey, Bain, BCG-type consultants — that's what they're doing. They get sent one day to an oil field, the next day, they're going to a farm to do agriculture. It's very, very broad. They've started doing data science within those spaces as well. (1:02:13)

Danny: It definitely can be done. But ideally, you would be very strong in at least one domain. You have to demonstrate that you can learn something quickly. That's probably the best way to do it — to demonstrate your knowledge in one space.

Alexey: And that space can be a technical space. (1:03:22)

Danny: I would say the domain expertise is more around the business domain. You can also have specializations, as opposed to the domain expertise. Let's say, Alexey is good at cloud computing and deep learning. Deep learning on the cloud is maybe his specialization. That would also count as a specialization where you don't need a lot of domain expertise in any specific business domain, as long as you can translate the different problems into your technical specification. (1:03:25)

Alexey: Or have a product manager or business owner who can help you do that. (1:04:05)

Danny: Yes. (1:04:05)

Breaking into data science as a fresher

Alexey: You probably get this question quite often. How do you break into data science as a fresher? Where and how to apply? (1:04:11)

Danny: Let’s tackle the applying one — it’s easier. There are two schools of thought. You either apply for everywhere, or you apply for nowhere. If you apply for everywhere, you're going through the public channels. You're going through LinkedIn, the company website, through recruiters. The flip side of that is you don't apply anywhere. You just only go through referral. It's difficult both ways for people who are breaking into the industry. There's so much competition and the area is just so hot. Those are the two different ways that you can apply for these roles. (1:04:24)

Danny: But no one really talks about how you get to the stage where you can start applying or even getting referrals. The most important thing for anyone trying to break into data science is a project portfolio. There's no other way, not in this competitive environment. If you're lucky enough, and your company happens to have an opening for a data scientist, and you're in an adjacent team, they can give you a chance to jump into that data science team. Like I did. You could be lucky. But it’s very, very rare.

Danny: Most of the time, for data science roles, they want people with experience. The way to combat your lack of experience is to demonstrate what you can do. One of the ways to do it in a quantifiable way is to have a project that's public. You can share it with the recruiters, with people who might be in the hiring management team. Or if you know someone in the company who knows other people. This is where your network comes in. But if you don't have a portfolio to show someone, it doesn't matter who you know. They can't vouch for you. I would definitely say — project portfolio with as best programming standards as possible. Try to avoid a lot of the common pitfalls that we see a lot in beginner projects.

Alexey: I can imagine what you're talking about — a Jupyter Notebook without comments. You need to execute cells in some particular order. Things are dumped there without any comments or without README, the requirements.txt file and things like that. Right? (1:06:57)

Danny: Yeah, maybe we can do a future session where we'll just point out common mistakes made by beginners on their projects, or different things like that. (1:07:12)

Is it easier to start as Type A or Type B?

Alexey: A follow up question from me personally. If somebody wants to break into data science, but they don't know, should they go with Type A or should they go with Type B? Do you think it's easier to go into a type A data science job than type B? Can we even compare them? Can we say that it's easier? Or these are just two separate things? (1:07:22)

Danny: It will be down to luck and availability. What's available for you to apply for, when you're looking for the first job? As bad as it sounds, if you had skills in both, you can apply for both. But not many people have the time investment to learn all of the different skills for type A and type B without work experience. I'd recommend relying on your background. If you have a software engineering background, it’s easier to make it through as a type B. (1:07:48)

Danny: But there's also an alternative side to that as well. Let's say you're a software engineer. You did computer science as your undergrad. If you decide to go after an analyst role, you'll be more competitive because you're a strong programmer, compared to people who are not as strong programmers. That's another perspective to think about.

Danny: I wouldn’t say there's a black and white — you should do this, or you should do that. Try and think strategically of how you can use your skills to best compete. Unfortunately, the environment is so competitive. Everyone is looking for jobs here. If you're an analyst type and you're hearing this, you're like, “Damn, I'm not very good at programming. How can I skill up in programming so I can compete against these other people?”.

Danny: There’s a catch. To be able to have a very strong programming background and develop it quickly — usually those things don't go hand in hand. It takes a lot of time to become a good developer. You can pick up strong fundamentals and start with good practices. It always helps. But time on the keyboard is time on the keyboard. Someone with two or three years more technical experience will always have stronger skills, compared to someone without any. Play with the cards that you're dealt, essentially.

Danny: You shouldn't feel discouraged about that. When we think of the analytics teams, a lot of people want to go straight into a type A data science role. These type A data science roles are actually more challenging than the Type B ones. You're expected to deliver value sooner. If you're the first data scientist within a company, you're not going to be a type B data scientist. They won't know what they're doing. You'll be tasked with, “Figure out why our sales are down. You've got three weeks.”

Danny: Usually, the best type A data scientists are more senior. They would have had a few more years of experience working with data. Their intuition around the data and the business is usually higher. For people when they're starting off, if you come from an analytics background, not a software engineering background, a data analyst role is not a bad option. You can skill up in a lot of the things, which we traditionally think of as data science skills.

Danny: You don't need to learn them on the job. You can learn them on Kaggle, through online courses, through different things like that. The most important thing when you’re starting off your career is being close to the data. The more you can use data — the better. SQL, Tableau, Excel – doesn't really matter. A lot of people really feel stressed about that. “If I don't get a data scientist role, I don't know what I'm going to do”. We shouldn't think like that. I thought like that when I started my career, I'd probably still be waiting for my first role. A lot of people are thinking the same thing too.

Alexey: So the short answer is – it depends on your background. The long answer is what you just gave. (1:12:03)

Danny: It would be — it depends on your background, your strategy of how you want to do it. I think it depends on what your interest is — how dedicated you are to whatever you want to do. (1:12:11)

Bootcamps — yes or no

Alexey: What is your opinion about bootcamps? Is it possible to break into data science type A or type B data science from a bootcamp? (1:12:26)

Danny: I think it's going to be rarer. There's definitely very high quality online training material out there. You've got the guys at Super Data Science who are doing amazing stuff. They're trying to build out certifications which people can use to validate their skills and use it to get jobs. That's really awesome. (1:12:36)

Danny: A lot of the bootcamps… Be careful. They're always going to have strong marketing, they'll tell you about star people who landed jobs in Facebook, Google, Amazon, after they took a six week bootcamp with no coding experience before. Those things actually happen, but they are rare. They're probably outliers. It's very difficult to teach all this stuff in six weeks. You can't really time box learning data science. It should be a long, long journey. I've been learning data science for who knows how long, five-six years. I still don't know what I'm doing. There's always something new to learn.

Danny: Bootcamps are good if you have the time and the money, and you're really willing to focus for that 12 weeks, and work your ass off. They're really, really valuable. They're good incentive devices to actually do that. When you do it, do a lot of research. Reach out to people who have taken those bootcamps to get their feedback. Find out how they were doing. If they have alumni that you can reach out to and learn a bit more. Definitely do as much research as you can. Don't dive into something in the spur of the moment and end up paying too much for something that you might regret.

Serious SQL — learn SQL from Danny

Alexey: It's a perfect segue to talk about your course. You mentioned that it took you a couple of years to learn what you know. It's not possible to cram everything into 6 weeks or 12 weeks. But you see a problem there with bootcamps, with education, with all these amounts of information that we have around. And you decided that you want to create a course. Can you tell us a bit more about it? What is it? For which type of data scientist is it? (1:14:37)

Danny: I recently launched the DataWithDanny Virtual Data apprenticeship program. The first part of that program is my Serious SQL course. I'm trying to train everyone in all the different skills that I've picked up over the past few years working in SQL. My vision for the online course business was actually really interesting. When I started doing all of these things, I actually didn't want to sell a course at all. I felt that there were ethical issues with charging people money for something which they can learn for free. Would people even pay for something that I'm creating? Or different things like that. All the regular doubts that people have also. (1:15:19)

Danny: In the end, I figured that the way that I learned my skills was directly through my mentors. It wasn't learning from the documentation… Of course, that helped. But the most important thing was actually to work closely with a mentor. These mentors might have been an expert in certain areas, but they also had things that they didn't know themselves as well. It was a learning journey between the both of us when we're working together.

Danny: In the same way, I wanted to create that same learning experience for anyone who's starting their data journey. My belief is that SQL is a huge thing for when people are going to be accessing data.

Danny: The next few things I wanted to teach were more around data visualization, data manipulation, forecasting, time series, machine learning. All of these things. But the first thing that I chose was SQL. I learned how to solve data problems by using SQL only. Having strong SQL skills is very important — just to understand data. It's almost like the natural language of how we use data.

Danny: I just launched my Serious SQL course. It's in live beta now. Parts of it are still not done. I need to work hard to finish them on time for the real launch. I've had a waitlist before. I've also launched it on LinkedIn. The price is currently $29 US. There is a student price as well. If you're a student and you're tuning in, please reach out to me on Slack or send me an email. The email will be support@datawithdanny.com. I can share the link with you.

Danny: The whole thing with the serious SQL course is to go through as many of the fundamental skills as possible, but also use those skills to solve as close to real life projects as possible, based on my experience. There'll be a lot of technical projects, lots of case studies and the focus is more on “How do we solve different problems?”

Danny: I'm going to add interview questions, case studies and things like that to help people apply their skills to get the job or to start building their project portfolio. If anyone's interested, please reach out to me.

Roadmap for becoming a data scientist

Alexey: There is a related question. What is the complete roadmap for becoming a data scientist? If somebody goes to your website, it’s not just this one SQL course. There is a learning path that starts with SQL. You also mentioned data visualization and machine learning. What else is there in this roadmap? What else should people learn to become a data scientist in your opinion? (1:19:05)

Danny: If you're starting from scratch and you have never used that data before, I would recommend SQL first. It's a natural way to learn how to use data. From there, start learning one of the popular programming languages for data science. This could be R or Python. The vision for my next course is to teach data visualization, a bit of statistics, forecasting and data manipulation. But in both languages at the same time. It's very useful to use both. The learning experience will be very positive. It will be very challenging, but we'll make it happen. The more you know, the more opportunities will open up to you soon. It's very good to know both. Those are the core skills soon. (1:19:35)

Danny: It doesn't matter if you do R or Python or both, but you'll need to have strong data visualization, analysis skills. Hopefully experimentation as well — A/B testing, defining metrics, conversion metrics, and things like that. And a little bit of forecasting – just enough to know what you need to do for 90% of the problems. That is the next step. That was my natural step, when I was going through my journey working towards a data scientist role.

Danny: After that, we'll focus on traditional machine learning algorithms: logistic regression, linear regression, decision trees, random forests, boosted trees, before we start going into deep learning. When we think of machine learning algorithms, how we do the measurement, and how we optimize hyper parameters, it's easier when we think of it in the traditional machine learning sense. With some of these algorithms, my plan is to get people to implement it using base Python, or using NumPy. I also wanted to cover Julia. I was hoping the machine learning thing will be in Python and Julia, ideally. That'd be fun.

Danny: Then, after traditional machine learning, deep learning comes in. You’ll learn about all the basics of deep learning – neural networks and different layers that you could use in your model. Also, Some of the new architectures, like transformers and attention networks. There's a whole lot of stuff and it's always rapidly evolving. By the time I get round to making that course, all of what I've said now is going to be made redundant. But it'll be a very exciting space to work in. I wouldn't recommend people dive into deep learning straightaway from day one. It’s very useful, but once you start using it, you shouldn't see it as a hammer.

Alexey: The next talk will start in five minutes, we should be wrapping up. You must be tired talking for an hour and a half? (1:22:48)

Danny: No, I'm fine. (1:23:03)

Importance of masters or PhD for data science

Alexey: A quick question to you. I know what you're going to answer. In your opinion, how important is it for a data science position to have a master's degree or a PhD? (1:23:04)

Danny: Different data science positions have different requirements. If you're working in machine learning research, where you're developing a new algorithm, or you're running some clinical tests with machine learning, or something research-heavy, you'd have a Masters or a PhD. You need a lot of experience doing those things, reading a lot of math, and being able to handle a lot of complexity — that's what you're trained for. If you think of machine learning engineering teams of Amazon, Facebook, Google — any of the big tech companies — they usually prefer people with a Masters or a PhD. (1:23:20)

Danny: But for others in the industry, they are not in that realm. You wouldn’t have to design a whole new layer of a neural network. It's not going to be required that you have a Masters or a PhD. It definitely helps if you've got a Master's in computer science, mathematics or statistics. For people who are making the decision of whether you do further education, it's a trade-off. How much time can you spend doing that education? Are you able to do it part time, full time, while still working on a job as well?

Danny: I haven't got a Masters or a PhD, but I've got a lot of industry experience working on lots of different problems. Most of my mentors actually have PhD’s or Master's. They told me instead of doing my further study, “Go read the books that are on the course recommended page. You might learn more than actually doing the degree. But I would say it's not always needed. It will help, for sure.

Alexey: That is it. Thanks a lot for sharing all your knowledge. You have shared a lot during this one hour and a half. That was an insane amount of information. Thanks everyone for watching and asking questions. We will now have a small break. Danny, you're free to go. Thanks a lot again. (1:25:21)

Danny: Thank you so much for having me on. (1:25:46)

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.