Mastering Algorithms and Data Structures

Links:

Marcello’s book: Advanced Algorithms and Data Structures (promo code for 35% discount: poddatatalks21)
MIT, Introduction to Algorithms
Algorithms specialization by Tim Roughgarden

Did you like this episode? Check other episodes of the podcast, and register for new events.

Transcript

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Marcello’s background
Learning algorithms and data structures
Resources for learning algorithms and data structures
Most important data structures
Learning the abstractions
Learning algorithms if they aren’t needed at work
Common mistakes when using wrong data structures
Importance of data structures for data scientists
Marcello’s book - Advanced Algorithms and Data Structures
Bloom filters
Where Bloom filters are useful
Approximate nearest neighbours
Searching for most similar vectors
Knowing frameworks vs knowing internals of data structures
Serializing Bloom filters
Algorithmic problems in job interviews
Important data structures for data scientists and data engineers
Learning by doing
Importance of compiled languages for data scientists
Wrapping up

Alexey: This week, we'll talk about algorithms. We have a special guest, Marcello. Marcello is a senior software engineer at Tundra. And he is the author of "Advanced Algorithms and Data Structures". I think it was released recently, right? It's out of MEAP now. Congratulations. It’s a book about algorithms. Marcello works with graphs, optimization, algorithms, genetic algorithms, machine learning, and quantum computing. Welcome. (1:51)

Marcello: Hi. Thanks a lot for inviting me. It's a pleasure. (2:52)

Marcello’s background

Alexey: My pleasure as well. Before we go into our main topic of algorithms, let's start with your background. Can you tell us about your career journey so far? (2:59)

Marcello: I've worked as a web developer and on data infrastructure. I started with web development. Then, I worked for five years in a government-owned company in Italy. Then I started working remotely with startups and then moved to Ireland to join Twitter. Then I worked for Microsoft, Apple. And last year, I joined Tundra — an online shop. (3:11)

Alexey: Does Tundra have anything to do with forests in Russia? (4:06)

Marcello: No, I don't know. Nobody knows where the name exactly comes from. But it's a nice company. (4:12)

Learning algorithms and data structures

Alexey: How should people approach learning algorithms? (4:51)

Marcello: It can be learned at a very different level, depending on how much in-depth you need to learn. When I started my studies, I was suggested that it's important not to focus on details. You need to learn that there is such an algorithm, when to use it — in which situations, and what problems it can solve. Also perhaps, how efficient it is. By knowing that there is such an algorithm, that it solves a certain problem, you can find it when you need it at work — when you should apply it. If you don't remember the algorithm by heart, that's fine. Nobody can remember all the algorithms. But if you know where to look for it, it's perfect. (5:19)

Alexey: I didn't have algorithms during my studies, so I was learning them outside of the university, by myself. Many courses I took focused on derivations and on mathematical proofs. It seemed like that is an important thing — to really understand how the algorithm works and to prove that it's O(N log N) using some difficult mathematical stuff. So this is not something we should focus on, we should focus more on applications, right? (6:47)

Marcello: Yes. There is also a funny story about this. Google was created by Page and Brin. When they started their studies, they talked to an Italian researcher about this idea of an intelligent crawler. It did the searching in a different way than just indexing like Yahoo or Altavista. This guy went to his Italian University and proposed this idea. They told him that, it would never have a future because you couldn't prove that it was right. So focusing on the mathematical proof can be tricky and dangerous. It's important, of course, if you are working on a paper to prove that the algorithm works. But it doesn't mean that if you don't know how it works, or don't work out the math, it will be useless. (7:30)

Resources for learning algorithms and data structures

Alexey: So when we learn algorithms we should focus more on applications than proofs. Do you know any good references for basic algorithms, like sorting? (9:00)

Marcello: There are a lot of resources online — courses and websites. There is a series of videos from MIT, it's very good. There is Tim Roughgarden's course on Coursera. It explains things clearly and it's as simple as it gets. If you prefer books, I can suggest Grokking Algorithms published by Manning, which also is a general introduction to basic algorithms and data structures. (9:23)

Most important data structures

Alexey: In your opinion, what are the most important algorithms and data structures that we should know? By "we", I mean developers, data engineers, and data scientists. So, for anyone who programs, what kind of algorithms data structures they should know? (10:08)

Marcello: Importance is relative. It depends on your field and what you're actually doing. The basic data structures can be the most important. They are the ones that can make a greater impact. Misusing an array or a list can hurt the performance of your application a lot. And it’s very common that you are using these data structures, so they are the most important ones. (10:34)

Alexey: So you need to know array, list, when to use them, when to use set, when to use a dictionary, right? (11:20)

Marcello: For example, knowing when you should use an array, or when you should use a list, depending on what you have to do. If you need random access, array is the best choice. But if you are always adding elements in front, array is complicated — you will have to do a lot of copying and moving of memory. It becomes a mess. Besides arrays and lists, the bare minimum for me would be stacks, queues, these kinds of structures. (11:30)

Alexey: And sets as well? (12:09)

Marcello: Sets, yes, of course. (12:12)

Learning the abstractions

Alexey: Let's say if we use Python, it comes with a set of different data structures. So knowing — at least having some idea — how they are implemented internally is useful. Like, if you want to add something, how does it work inside? If you want to check if something is in a list or is in a set, how does it work? And things like this. (12:17)

Marcello: Yes, absolutely. With algorithms, you have to distinguish the implementation and the abstract data structures. The first step would be understanding what's the abstraction behind it. And then you can implement it in many ways. For example, you could implement a dictionary with anything — with a list or an array, or a tree, or a hash table. All these implementations have pros and cons. They do well on some operations and perform poorly on other operations. (12:54)

Marcello: If you use a language like Python, you are more interested in understanding the abstraction behind it. What's the API of the dictionary, what are the operations that you do? Then if you delve into the language, if you're performing time-critical operations or memory-critical operations, then you might want to dive into the implementations. Understand how we can leverage those or if they can present any problem or any bottleneck. (13:36)

Alexey: So you can take all the data structures that you should use, from Python or any other language, and you learn their APIs. What are the possible methods? And then try to understand how they work internally, right? (14:13)

Marcello: Yes. For example, you mentioned the set. It's important to know that what's the contracts that the client has with a data structure like a set — you can add elements, you can remove elements, but there will be no duplicates. You can expect that insertions and checks are reasonably fast compared to an array. Although this also depends on the implementation. (14:33)

Learning algorithms if they aren’t needed at work

Alexey: You mentioned that you worked as a web developer at some point. I heard this from many web developers, and also from data scientists as well. Let's talk about web developers. These days, they do simple things — they create simple web applications. They say, "I don't actually need algorithms for that. All I need is this library React. It works, and I don't need to use algorithms." (15:09)

Alexey: So how do I learn algorithms if I don't need them at work? It's not unique to web developers. (15:45)

Marcello: First, I'd like to challenge the assumption that you don't need algorithms for that. If you as a web developer, or a data analyst, or a data scientist, you use algorithms more than you think. Even the basic one that we mentioned earlier — I cannot believe that you are not using arrays or lists. That can make a big difference. As a web developer, you may have time-constraint or resource-constraint. Or you may have to handle large data sets as a data scientist. Web developers can be in a situation where using the right data structure makes a difference. For example, if you have to provide some spell checker functionality. If you know what Bloom filters or tries are, then you're in a better position. Otherwise, you might end up reinventing the wheel or providing a suboptimal solution, whether or not you use a third-party library. (15:57)

Marcello: To go back to your question, how do you master algorithms if you don't have the chance to work every day on this in your job? (17:40)

Alexey: Yeah, I don't implement spell checkers every day. How do I learn how to use Bloom filters? (17:51)

Marcello: There are a few things. If you're interested in the topic, there are a lot of resources. You can do some learning on your own, you can set goals. But if you're looking for extra motivation, joining some competition like Google Code Jam or something like that, can be a good push for you. You can get motivated to learn more. And more than that, it gives anyone the chance to learn in the field and have some practical experience with these algorithms — not just knowing the theory but actually learning to use them and to take advantage of these data structures. (18:01)

Alexey: So if you cannot do this at work, try to do this outside of work. (19:07)

Marcello: Well, it's not common that at work you need to implement these algorithms from scratch. But you can learn how to use them at work. One thing you can do — if you see that there is a bottleneck or see some room for improvement when profiling your application, you can try to learn which algorithms can help in similar situations and try to apply them. Especially if you use a mainstream programming language at work, it's easy to find libraries that implement common and advanced algorithms. Then you will see how they can make a difference. (19:14)

Common mistakes when using wrong data structures

Alexey: One mistake I often notice in code is people accidentally use a list for checking for containment instead of using a set. Simple things like that are very common for web developers and for data scientists. This is a very common operation. You have something that comes in, and you want to check if this is something that you already know or not. For that, you check if this X is in a collection of seen things. If we just replace a list with a set, then we see an order of magnitude improvement in speed. (20:14)

Marcello: Exactly. Similarly, I have seen that for keeping track of elements, for adding elements to a list, people add it to the wrong end of the list. For example, in Scala, or in Haskell, adding it to the wrong end of the list can cause this simple operation to become slow and time out your server. (20:58)

Importance of data structures for data scientists

Alexey: Coming back to the question, "How important are these data structures for data scientists?" I think we just mentioned this particular use case like checking for containment. And as a data scientist, I do this operation very often. In your opinion, are there other cases where it's very important to know data structures for data scientists? (21:38)

Marcello: Whenever you are working on huge data sets, even the slightest improvement can make a difference in time. And if you have an order of magnitude improvement, that makes a tremendous difference. It can be speeding up searches — using Bloom filters instead of dictionaries to keep track of what you have already seen. It can be the nearest neighbour search to search in this huge multi-dimensional data sets. I think they are even more important for data scientists. (22:12)

Marcello’s book - Advanced Algorithms and Data Structures

Alexey: You mentioned bloom filters and approximate neighbour search. This is actually something I wanted to talk to you about. You cover them in your book. So maybe let's talk a bit about your book. First of all, what is there in your book? Can you tell us a bit about that? (23:10)

Marcello: The idea for writing this book was to provide a bridge between theoretical knowledge on algorithms in textbooks and more practical knowledge from hands-on books. My book covers both the theory and more practical aspects of how to use the algorithms. For each data structure or algorithm in the book, the focus was coming up with a real-life use case where you can make a difference by using the right algorithm. The problem can also be the opposite: you can make a negative difference by using the wrong data structure in the wrong place. If you learn to avoid that, that's already great and you can improve the performance of your applications. (23:39)

Marcello: There are 18 chapters and 3 parts. The first part and the appendices cover the basic data structures — they cover the ground. Then we go into more complicated algorithms. In the second part, we cover nearest neighbour search, machine learning clustering, explain the MapReduce programming model. In the third part, we cover graphs, evolutionary algorithms, and optimization in general — different options for permutation problems from random algorithms and random sampling to gradient descent, simulated annealing, and genetic algorithms. (25:04)

Alexey: It's called "advanced algorithms". Does it require some knowledge of algorithms already? (26:19)

Marcello: We try to cover the basics in the appendices in the first few chapters. You shouldn't need anything more. Of course, if you had "Algorithms 101", or if you have previous experience with the topic, you're in better shape. (26:31)

Alexey: I guess if somebody watches this MIT course that you recommended, or the Coursera course by Tim Roughgarden, then these will give enough foundation to continue with your book. But even if they don't do these courses, you cover everything in the appendices as well as the first chapters, right? (27:08)

Marcello: Yes, we cover everything you need to start. You don't need complex math, knowledge of linear algebra, and you need initial knowledge of programming. We don't use a single programming language, we use pseudocode, so anyone with any background can understand how the algorithms work. The only thing that will help is knowing what a for loop or conditional is. (27:38)

Alexey: I know that you also have a GitHub repo, where all these algorithms are implemented in every possible language. (28:37)

Marcello: Not every possible, but my goal is to have them in as many languages as possible. For now, most of them are implemented in Java, JavaScript, and Python. Soon, in Scala, and I was hoping to add C++ and Rust later. (28:49)

Alexey: So you also had a lot of fun implementing it in all these different languages. (29:11)

Marcello: Yes. It's fun. (29:17)

Alexey: Do you plan to cover Go as well after you cover these ones? (29:20)

Marcello: Yeah, why not? I love Go. (29:27)

Bloom filters

Alexey: When I look at the table of contents, I got interested in Bloom filters and approximate nearest neighbours, and coincidentally, this is what we already talked about previously. I thought maybe we could cover a bit these data structures a bit? (29:43)

Alexey: Let's start with Bloom filters. So what problem do they solve? And why do we need them? (30:09)

Marcello: It's not a coincidence, they're very useful for data science. Bloom filter is quite an interesting data structure. Surprisingly, it's not as widespread as I would expect. Bloom filters solve the traditional dictionary problem. The dictionary is a container. You can save entries there and retrieve them quite fast. There are many different ways you can implement it. For example, you can implement it as a tree — as a fully balanced tree or a binary search tree. Then you can get good performance for almost all applications. But what people usually associate with dictionaries are hash tables — they are synonyms in many languages. (30:17)

Marcello: Bloom filters work similarly to hash tables — they leverage hash functions. But they follow a different approach compared to hash tables, which allows them to use limited memory. If you have a large data set, you might not have enough memory or disk space to use a hash table. This happens especially when you store variable-size data such as strings in your containers. In that case, you can store each entry in a Bloom filter regardless of how much space they require. You can store them with the same amount of space. And we need a fixed number of lookups to find those elements. (31:42)

Marcello: Of course, you have to pay a price for this. The price can be performance, because each time you look it up or store it, you have to hash the same entry many times. The other big disadvantage is that you can have false positives with Bloom filters. If you look up for an entry, the Bloom filter can tell you that it was stored although it actually wasn't. This is caused by the way that works internally. I explain these in chapter eight. I don't know if we have time to explain it. (33:12)

Alexey: Probably not. I just wanted to ask, what are they used for? To summarise you said: we need to use this data structure when we have a limited amount of memory. It uses hashes to look things up. We use it to check if something is in our Bloom filter or not — for containment. But the way it works, sometimes it gives us false positives. It can say "this item is there", but actually it's not true. (34:03)

Where Bloom filters are useful

Marcello: Sometimes it's not a big deal. Bloom filters are used in many, many places. For example, in crawlers to check if a page was already visited — by looking at the URL, or even at the content of the page. They were used in spell checkers, but now they are placed by tries. But for a long time, they were used for that. They are used a lot in routing tables to check if an IP address was already visited or not. In all these cases, if you have a false positive, it's not a big deal. In a crawler, you will process the page again. With Bloom filters, you can balance the amount of memory you use with the false-positive ratio. You can control how often false positives happen and how often you pay this penalty. (34:43)

Alexey: Maybe I can also tell about a use case I had a couple of years ago at the previous company. The company is adtech company. They're doing advertisements. They're selling advertisements on mobile devices — all these annoying ads that you see when playing games, we contributed to that. (35:59)

Alexey: Every phone has some ID — the device ID. Let's say I am a returning user of an app. I have used the app already, and the owners of the app want to bring me back. For example, I played 10 levels and stopped. They want to show me an ad saying, "Hey, come back, finish your game". So they have the device IDs of everyone who played the game, but stopped — and there are hundreds of thousands of device IDs if a game is popular. (36:25)

Alexey: When I open a different app, it's sending a request... There is an auction happening under the hood, but it doesn't matter now. But we want to check if we know this person or not — is it a returning user or not? Imagine that for everyone in the world who's holding a phone right now, we want to show an ad. We only want a subset of those people — only the returning users. For that, we use a Bloom filter to check if we know this user or not. Because it's impossible to store everything in memory. If it turns out that we actually don't know this user, even though we think we do, it's not a big deal. We just show that person an ad. We lose a fraction of a cent, but the world doesn't stop because of this. (37:25)

Alexey: It's used a lot in industries like marketing. Every time you want to bring back a user, you need to store all these users. This is when I learned why Bloom filters exist and why we actually need them. Before that, I had no idea. Previously, I just watched this course on Coursera by Tim Roughgarden. It looked complex and I had no idea why this thing is actually needed. (38:32)

Marcello: It is a perfect use case. (39:04)

Approximate nearest neighbours

Alexey: What about search trees? You have another part of your book where you talk about approximate nearest neighbours. Maybe we can talk about this use case as well. Why do we need approximate search trees for approximate nearest neighbours? (39:10)

Marcello: We need nearest neightbors search in many fields, especially in data science — when we need to search in multi-dimensional data. Binary search tree is a fast way to search in a static or slowly changing set. You can also use automatically balanced search trees like red-black trees to tackle more dynamic sets. But they work for one-dimensional datasets, and we often have to deal with multi-dimensional data. Even geographical data with geolocations. And there are also other data sets with hundreds of features. (39:33)

Marcello: Binary search trees don't generalise well to multi-dimensional sets. It's possible to use it, it's still faster than going through all the data points, but for multi-dimensional sets, it may become costly even to compare a single data point with the rest of the data. (40:59)

Marcello: The way we can solve this is to use the nearest neighbour search. There are different data structures for that. The first one that was invented to deal with this particular problem was the KD trees. It's 40-50 years old, and for a long time, that was the best solution for this. However, now, there are even better structures — KD trees have some problems. They work well up to a certain dimensionality of the data, and they don't work well with high dimensional data sets. And they also have a problem with dynamic sets. (41:37)

Marcello: In the book, we go through a credit risk example to whet the appetite, to explain the basics and explain why nearest neighbour search is important. Then we go through a real case of using geolocation for a delivery system of an online shop — to handle millions of orders and for each of them, find the closest warehouse from where the goods can be shipped. We then go through R-trees and SS-trees (similarity search trees), which handle high-dimensional spaces better, and allow this "approximate nearest neighbour" search. (42:44)

Marcello: Sometimes we don't need the actual best possible results, we can be good with close-to-the-optimal results. For example, if two warehouses are approximately 10 km away from the destination, it doesn't matter if one is 100 meters closer. If we can perform this search faster, and find the sub-optimal solution which is only 1% and 0.1% further away than the best possible solution then, in many, many areas, for many, many problems, it's pretty much the same. (43:44)

Searching for most similar vectors

Alexey: I have an example in my mind, but I'm not sure if this is a great example for search trees. I work at OLX. OLX is an online marketplace and we have a recommender system there. In the recommender system, we want to recommend to a person things that they might be interested in. Think of Amazon as well — based on what you saw previously, we want to recommend something that the user might be interested in. For that, we represent each item with a 16-dimensional vector — an array with 16 numbers. Then we do a similar thing with the users — we represent each user as a 16-dimensional array. You have an array for a user, and you have an array for each item. Then, for each user, we want to find the closest possibly item array to the user array. We look at all the items, and we try to find the closest one. Often, we don't need the closest one. We just need something that is close enough. Is it a good use case for that? (44:46)

Marcello: Yes, it's a perfect use case. This, or finding similar images, if the images are translated to feature vectors. For example, we may want to find similar images to a product that the user — or similar users — already saw. Even finding not just the closest one, but the five closest profiles to some users or five closest images. It's a perfect use case. (46:16)

Alexey: We use a library for that. It's called "faiss", it's from Facebook. To be honest, I don't know what it actually uses inside. I just know that it works faster than brute force search. It probably users one of those data structures inside. (46:50)

Marcello: It's possible. These data structures are used a lot in machine learning. For example, in clustering, K-means or other clustering algorithms can use this nearest neighbour search to speed up the algorithm. (47:11)

Knowing frameworks vs knowing internals of data structures

Alexey: We have a question that may be quite related to the point I just brought up — about using a library and not necessarily knowing what is inside. It's a question from WingCode. Is it necessary to know data structures? Or knowing how to use a framework is more important? For example, we can just take an off-the-shelf implementation of Bloom filters. Do we need to know how these things work inside? (47:47)

Marcello: The most important thing is to know how they work on the outside. What you can expect? What is the contract that you have with this structure? What are the guarantees that you have from them? Most of the time you can be fine not knowing the internals. You need that only if you have to improve your performance or if you run into problems. The other case where you might want to know how things work is when you have to do customization — when you cannot use something off-the-shelf and you have to write your own. Another possible case is when you're using a new programming language for which there is no such library yet. So you have to write your own. You have to be the first one. But it's more common that you have to implement a customised solution yourself. (48:23)

Serializing Bloom filters

Alexey: I was talking about this use case of an adtech company. We ended up implementing Bloom filters ourselves. We needed to have exactly the same implementation for multiple languages — for Go, for Java, and for JavaScript. And for Python as well — because we are data scientists, and the data scientists work in Python. If we create a Bloom filter, we need to make sure that this Bloom filter can be used by whatever other language we were running. We ended up implementing Bloom filters ourselves. I did that. I remember that I took the implementation from somewhere and just re-implemented it. I cannot claim I actually know how it works. But it seems to work. (49:52)

Alexey: In Bloom filters, you have false positives. So, you need to know at least a little bit about the internals of bloom filters — to understand that you can control false positives based on the size of your set and based on the false positive error rate you want. How can you make sure that you can minimise this error rate? You need to know a little bit about that to use Bloom filters. But maybe for the first use case, you can just go ahead and use something like Google Guava. It's a library in Java, they use a pretty good preset configuration. You don't need to care about what is inside. It just gives you an okay Bloom filter. Then, if performance is not good, you can try to understand what's going on inside and tune it. (50:44)

Marcello: Yes. This use case is an ideal example. You needed a serialisation of this Bloom filter and you needed to have the same seeds for the hash functions. That's another case where you might want to have control of these things. (51:56)

Alexey: We were producing Bloom filters in a Python job — we're data scientists, that's the only language we know. But then it was used by production systems written in Java, Go, and, for some reason, JavaScript. But we needed to read these boom filters. It was fun. I liked doing that. (52:23)

Algorithmic problems in job interviews

Alexey: What do you think about job interviews? In job interviews, companies seem to be really obsessed with algorithms. You worked at Twitter, at Microsoft as well. I have an impression that if you want to get into these companies, you really need to know all the "Algorithms 101", and you also need to know more advanced algorithms. You need to know trees, graphs, and so on. What do you think about this? Is it reasonable that these companies expect everyone to know these things? (52:55)

Marcello: These companies have a lot of experience interviewing people. It's hard for me to say. When I was at Twitter, we worked actively on changing the way interviews were done. When you focus only on challenges and algorithms, you are not interviewing the candidates with the right knowledge and the right skills. Maybe they will use them. Certainly, it's good to know if a candidate knows about performance bottlenecks and how they can screw everything by misusing an array. But it's not the only part of the job. There is much more. I've seen candidates who did a stellar job on the algorithms interview and then they could not use Git. They had to catch up a lot on their first month on the job. I think it's too much. It's good to have some questions on these, but it's also good to test a different set of skills in the interview. I like mixing different kinds of interviews: doing some pair programming, or even some debugging in the interviews. You can see how is it actually working with this person, and what they can do on the job. (53:47)

Alexey: Like debugging Bloom filters? (55:37)

Marcello: That's maybe a tough one. (55:42)

Alexey: Could be. I remember I had an interview with Facebook. I don't remember the exact questions. But you need to solve two problems in 35 minutes. Two, not just one. If you spent 30 minutes solving one, then you have just five minutes for solving the second one. It's too cruel. (55:45)

Marcello: When you have limited time, maybe you don't have the right idea immediately. Also because of the pressure. (56:20)

Alexey: For me, they had two such interviews in a row. So, two times like that. That was too much. (56:32)

Important data structures for data scientists and data engineers

Alexey: I didn't notice one interesting question. There are quite a lot of algorithms and data structures. For data engineers, data scientists, and everyone else who works with machine learning, what are the most needed ones? We talked about arrays and sets. Is there anything else that every data scientist and data engineer should keep in mind? (56:43)

Marcello: For sure, the basics. At least knowing what binary search is — it's a must. (57:20)

Alexey: Especially if you want to get to Facebook. (57:33)

Marcello: If you're going to interview for these companies, I have to share that I had similar experiences as you. You need to know all the basics and graphs as well — like DFS, BFS, even Dijkstra. But probably not much more than that. Sometimes I got questions that could be solved with interval trees or more exotic ones. (57:40)

Alexey: There are programming contests. You need to use some very smart data structures to solve problems there, like interval trees. These people will have no problems getting into Facebook. (58:17)

Marcello: Actually, I once was interviewed by one of the former champions. (58:34)

Learning by doing

Alexey: Maybe the last one. Can you suggest good resources for building projects to learn data structures and algorithms? To learn them by doing — by using them? I think already you suggested taking part in online competitions, like topcoder, are there are many of them. But maybe we can build our own pet project to learn data structures and algorithms? (58:53)

Marcello: There are sites like leetcode. You have problems there and you can try to solve them. You can also see other people's solutions. You can learn new techniques to solve the same problems. Another suggestion is working on an open-source project. You can join one or start your own. You can also get feedback, especially if you join an existing project. If you start yours, you'll need to put yourself out there to get some feedback. (59:44)

Importance of compiled languages for data scientists

Alexey: Thank you. Another question popped up. Do you recommend to data scientists, interested in data structures and algorithms, to go into compiled languages like C++ or Java, rather than use Python? Is there any advantage going this way? (1:00:39)

Marcello: I think Python is perfect for data scientists. It has the best libraries for data. It's like Esperanto for data science. But you might need to implement the production model, then you might need C++ more than Java. It will allow you to write more performant code, with multi-threading and greater control on the low-level details. Maybe sometimes it's the difference between data scientists and data engineers. Data engineers could be more on the C++ side than data scientists. It can be useful. But I wouldn't suggest switching one for the other. But maybe learn about it. (1:01:09)

Alexey: And there is a thing called “Cython”. You don't need to ditch Python completely, you can use Cython — it's almost C. You can have quite performant and typed code for number crunching. You can still stay within the Python realm, but you also get the benefits of writing in native code. (1:02:25)

Wrapping up

Alexey: I guess that's all for today. How can people find you? (1:03:01)

Marcello: On Twitter and also on Slack. (1:03:07)

Alexey: Thanks for the chat. Thanks for joining us today, for sharing your experience with us. It was nice chatting with you. (1:03:32)

Marcello: Thank you. Thank you again for having me. (1:03:41)

Alexey: I aslo wanted to thank everyone for joining us today and asking questions. That's it. Have a great weekend. (1:03:42)

DataTalks.Club