Data Engineering Zoomcamp: Free Data Engineering course. Register here!

DataTalks.Club

Working in Open Source - Probabl.ai and sklearn

Season 18, episode 4 of the DataTalks.Club podcast with Vincent Warmerdam

Did you like this episode? Check other episodes of the podcast, and register for new events.

Transcript

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Alexey: This week, we'll talk about open source again. We have a very special guest, Vincent, for the second time. This is not Vincent's first appearance. You were one of the first guests on this podcast more than three years ago. (1:40)

Vincent: Three to four years ago. Yeah, something like that. (2:00)

Alexey: I was checking the previous podcast episode before we started, and it was already season two. Season one had only five episodes, and you were one of the first recordings of season two. We didn't have transcriptions back then, so I had no idea what we talked about. But the topic was getting started with open source. Today, we'll talk about open source again. Vincent, you come to mind when I think about open source because of the numerous small libraries you've created and discussed. (2:03)

Vincent: It's not thousands, just to be clear. (2:49)

Alexey: Hundreds? (2:52)

Vincent: It's a small dozen, for sure. It is a bunch, but thousands is a lot. (2:52)

Alexey: Compared to the average person in the industry... (3:02)

Vincent: It's probably above average. (3:09)

Alexey: Maybe the 99th percentile? (3:11)

Vincent: I did some research and found out I'm in the top 10 for open source contributions on GitHub in the Netherlands. I knew three other people in that top 10. I'm kind of up there, but not thousands. (3:14)

Alexey: Fair enough. That was the bio. I promised to improvise since we don't have the bio in the show notes. You will tell us more. (3:46)

Vincent: Sure. Thanks for the lovely intro. (3:57)

Vincent’s Background

Alexey: Thanks for being here. Before we start, I want to shout out to Johanna Bayer for preparing today's interview questions. Thanks, Johanna. The reason we're speaking today is because she met you at a conference, correct? (4:00)

Vincent: Yes, at a PyLadies Code Sprint. There were many projects, and I was there on behalf of Scikit-Learn to help people get their first PR in. Johanna was in my Scikit-Learn bubble doing docs work and then asked if I wanted to come on this podcast. I recognized the logo and agreed to return. A few years ago, we talked about Scikit Lego, one of the projects we discussed. (4:19)

Vincent: That project played a role in getting me my current job. Scikit Lego consists of Lego-like bricks I built using Scikit-Learn tricks. It started as a side project but now has around 30,000 downloads a month and is in production in many places. It's one reason the Scikit-Learn core maintainers thought having me at the company would be useful. I initially thought it was just a cute plugin, but it turned out to be significant during my job interview. (4:19)

Alexey: We usually start our interviews with your background. Can you talk about your career journey so far? (5:48)

Vincent: Sure. I studied econometrics and operations research, which is quite math-heavy. Around graduation, I discovered machine learning and wanted to try it out. I decided to backpack while taking some client programming work with me. I found programming as enjoyable as clubbing, which signaled me to pivot my career towards tech instead of being a consultant. (6:03)

Vincent: I did tech consulting for a while before getting a job offer from Rasa as a Developer Advocate. I hadn't heard of that role before, but it sounded interesting as a gateway to NLP. The consultancy I was at didn't offer NLP opportunities, so I switched. (6:03)

Vincent: I was a big fan of spaCy, and two years after working at Rasa, Explosion AI (the creators of spaCy) hired me as a Developer Advocate and Core Developer on their Prodigy product. I did Developer Relations and core development there for two years. Eventually, I joined some Scikit-Learn core maintainers starting a company called :probabl., and that ball got rolling quickly. (6:03)

Vincent: Now I do open source work and Developer Relations at :probabl. That's a quick summary of my career. Along the way, I've organized conferences and meetups and built several open source projects. (6:03)

Alexey: How what? (8:01)

Vincent: That's an American saying, 'That's how the cookie crumbles.' (8:03)

Alexey: What does that mean? (8:06)

Vincent: It's similar to 'Bob's your uncle,' meaning 'That's the short story.' A cookie crumbles in one way; the crumbs fall down and never split in other directions. It's an American saying. Sayings are weird. [chuckles] (8:08)

Scikit-Learn’s History and Company Formation

Alexey: There are companies behind open source products that typically don't share the same name. For example, Explosion AI created spaCy, and :probabl. is associated with Scikit-Learn. Why didn't they just name it Scikit-Learn? (8:33)

Vincent: Rasa has 'Rasa' in their product name. Some companies do that. (8:53)

Alexey: They do. So, why not name it Scikit-Learn instead of :probabl.? (8:56)

Vincent: In :probabl.'s case, Scikit-Learn is a huge project with a vast community. Some core maintainers work at :probabl., but the company is not Scikit-Learn. There's a distinction. (9:04)

Vincent: You could say :probabl. acts as a brand operator. We intend to hire many open source maintainers to work on this, but it's a larger community. Claiming the name for a company wouldn't make sense. (9:22)

Vincent: And Explosion does spaCy but also other stuff. (9:40)

Alexey: Prodigy, right? (9:45)

Vincent: Yes. Naming your company after your open source project limits you to that project. :probabl. will likely do more than just Scikit-Learn. We might offer training, consultancy, and possibly other products. There are many reasons not to call ourselves Scikit-Learn. (9:46)

Vincent: Also, the legal aspect. Scikit-Learn is a name that already exists. It's probably registered and trademarked, so we can't use it. (9: 46)

Alexey: I'm checking the Scikit-Learn website. I don't see any trademark, but it probably belongs to NumFOCUS or another organization. (10:28)

Vincent: To my knowledge, it's its own entity. NumFOCUS is an umbrella for funding some open source projects. There are NumFOCUS projects and 'associated' projects. I believe Scikit-Learn is an associated project. (10:39)

Alexey: So NumPy is a NumFOCUS project, and Scikit-Learn is only associated? (11:04)

Vincent: That's my understanding. These distinctions matter because Scikit-Learn has been a large community for decades. It's not something a company can just claim. (11:10)

Alexey: Before that, the creator originally... It originated from Inria, the research lab in France. What's the story there? (11:35)

Vincent: I know parts of it. The original version may have started as a Google Summer of Code project. Inria played a significant role in its maintenance, with several people working on Scikit-Learn and writing papers. Companies also sponsored developers. For example, Andreas Müller was supported by NYU for his work on Scikit-Learn. (11:50)

Vincent: Microsoft has a similar arrangement now. Companies like Quansight Labs provide consulting and contribute PRs. It's beneficial for companies to have core developers on their team. (11:50)

Alexey: Like Scikit Lego. (13:24)

Vincent: Yes, and projects like UMAP. If you want to use the UMAP clustering visualization algorithm, you need to install it as a Scikit-Learn plugin. Many projects like it are valuable to the ecosystem. (13:26)

Maintaining and Transitioning Open Source Projects

Alexey: Contributing something new to Scikit-Learn is difficult because maintainers are cautious about maintaining new methods. It's easier to create a plugin that follows the API and is maintained separately. (14:01)

Vincent: There are concerns beyond maintenance, such as benchmarking and quality. Scikit-Learn is seen as an example to follow, so what's included should be high quality. Some core algorithms are kept for historical reasons, but not every new paper can be included, or it would become unmanageable. (14:35)

Vincent: UMAP isn't in Scikit-Learn because it relies on 'numba,' an LLVM compiler trick to speed up Python code. Introducing new dependencies can be an issue. Scikit Lego, which I helped maintain, looks at Scikit-Learn issues and implements fun and useful features that Scikit-Learn can't include. (14:35)

Vincent: Scikit Lego is fun and useful for maintainers. It's not taken as seriously as Scikit-Learn, but it allows for experimentation and implementation of features that Scikit-Learn can't include. (14:35)

Alexey: Can you tell us more about Scikit Lego? How did maintaining this library lead to working at :probabl.? (16:43)

Vincent: As a consultant, I noticed the need for reusable components for tasks like selecting columns from pandas. Instead of re-implementing the same thing repeatedly, I created a collection of Lego bricks. A colleague and I used these components for training and teaching open source, making it a utility for corporate training. (16:53)

Vincent: Using Scikit Lego for corporate training helped me get contributors. Offering students the chance to commit to open source as part of their lesson was a win-win, making the library better while helping them learn. (16:53)

Alexey: That's smart. (18:09)

Vincent: It's enjoyable and beneficial. As time went on, I used the library less, so we looked for a new maintainer. Francesco volunteered, and his fresh perspective and enthusiasm improved the project. (18:11)

Alexey: At PyData Amsterdam, I told people we were looking for a new maintainer who actually uses the library. Francesco had already contributed and expressed interest, so we discussed it further. (19:02)

Vincent: Francesco uses Scikit Lego at work, which makes him a great maintainer. He adds features and enjoys maintaining the library, which is crucial for its sustainability. (19:13)

Alexey: You approached people at the conference, asking if they wanted to maintain the library? (20:02)

Vincent: Yes, but it was more about finding someone who uses it. Francesco had already shown interest and made contributions. Meeting in person at the conference solidified it. (20:17)

Alexey: And he uses Scikit Lego at work, right? (21:03)

Vincent: Yes, though some details are private. He finds the library useful and fun, which aligns with our goal of maintaining it as a fun project. (21:08)

Teaching and Learning Through Open Source

Alexey: How do you make a library fun to maintain? (21:49)

Vincent: We celebrate that it's volunteer work. We encourage implementing features if they're in someone's domain or sound fun. We require benchmarks to confirm improvements. If it's not fun, it won't be maintained. (21:51)

Vincent: I have a child now, so I won't spend my evenings on uninteresting tasks. Francesco's fresh perspective and ideas keep the project enjoyable for both of us. He would make a great podcast guest to share his perspective. (21:51)

Role of Developer Relations and Content Creation

Alexey: You told us about Scikit Lego. How did it lead to your current job? (23:29)

Vincent: The Scikit-Learn maintainer group was looking for someone who could clearly explain data science and machine learning without the hype. My resume, including Scikit Lego, showed I took testing and quality seriously, which helped in the interview process. (23:44)

Vincent: The technical interview was lightweight because they could see my work with Scikit Lego. Other factors, like conference talks and keynotes, also helped. Maintaining Scikit Lego showcased my dedication to open source and quality. (23:44)

Alexey: I was talking about thousands of small open source libraries and you said, 'It's not true.' But when it comes to your talks, I searched your name on YouTube... (25:27)

Vincent: Yeah? (25:41)

Alexey: I couldn't finish scrolling. (25:42)

Teaching Through Calm Code and The Importance of Content Creation

Vincent: I'm a frequent speaker at PyData. During COVID, I started Calm Code, a tutorial website as an alternative to DataCamp. I didn't like their approach of pushing future-proof skills. Calm Code focuses on learning tricks to make your day-to-day work easier. (25:46)

Alexey: That’s why it’s 'calm' – no pressure to learn things. You can learn whatever you want. (26:37)

Vincent: Yes, just useful tips to improve your workflow. Making over 700 videos for Calm Code helped me practice clear communication. Creating videos is now like writing a FOR loop for me. (26:44)

Alexey: Thousands is actually not too far from the truth. (27:20)

Vincent: Yes, counting my work at Rasa and Explosion, it's close to 1000 videos. At Explosion, we focused on quality over quantity. (27:24)

Alexey: Also, counting all your work as a Dev Advocate at Rasa, Explosion, right? (27:30)

Vincent: Yes. At Rasa, I made around 100 videos. At Explosion, I made fewer but more polished videos. With experience, recording videos became easier. The challenge now is coming up with good examples and insights. (27:39)

Alexey: Coming up with examples is the most difficult part, right? (28:28)

Vincent: Yes. Having access to Scikit-Learn core maintainers helps. I can ask them about annoying issues they've seen on GitHub. There are always interesting experiments and benchmarks to explore. (28:32)

Current Projects and Future Plans for Calm Code

Alexey: Do you still actively put things out on Calm Code? (29:25)

Vincent: Yes, but I've realized that collaboration is more sustainable. Having collaborators with different expertise makes the platform more effective and enjoyable. We're building a proper platform for Calm Code, moving beyond just markdown files. (29:30)

Vincent: The Django app is live. We're learning how to handle payments to make the project sustainable. The goal is to have a hobby project that provides a small income, allowing us to hire external contributors to create content. (29:30)

Alexey: So right now it’s Django? (30:53)

Vincent: Yes, it's a full Django setup. We're adding features and learning as we go. The hope is to cover more topics like databases, data analytics, cloud, Docker, and Kubernetes. (30:54)

Alexey: I see that there is already some Docker stuff, right? (31:38)

Vincent: Barely. There's much more we want to do. I'm interested in exploring when it makes sense to use a custom runner for GitHub Actions to save on compute costs and optimize performance. (31:42)

Data Processing Tricks and The Importance of Innovation

Alexey: A runner executes the action on your environment, not on GitHub’s? (32:18)

Vincent: Yes. Using a VM you own can offer caching benefits and save costs. There are startups like Leaf.cloud offering carbon-negative compute, making it both economically and environmentally competitive. (32:26)

Vincent: Leaf.cloud places server racks in apartment basements, using the heat to preheat water, saving gas. This setup is carbon negative and cost-effective, as the compute is paid for by another party. (33:09)

Vincent: With such setups, you have control and can still achieve great results. Celebrating intellectual freedom and innovation is something we aim to highlight on Calm Code. (33:09)

Alexey: You mentioned that for experienced Python users, pip install is second nature. For newcomers, Docker is challenging. In our data engineering course, Docker is the most problematic module. (34:29)

Vincent: For beginners, pip, Docker, and Git are major stumbling blocks. It's important to teach the conceptual understanding of these tools, not just the commands. (35:07)

Alexey: You want to cover all three, right? (35:19)

Vincent: Eventually, yes. But designing a good course is a one-time effort. (35:21)

Alexey: You already have. I think I saw a logo of GitHub on your... (35:31)

Learning the Fundamentals and Changing the Way You See a Problem

Vincent: I use GitHub in many courses, but Calm Code assumes you're not a complete beginner. You need some programming experience. (35:36)

Alexey: That's what we do, too. Where do people actually learn all that stuff? (35:48)

Vincent: Sometimes it's about teaching the mindset, not just the tool. For example, how to think about Git conceptually or recognizing common issues. (35:54)

Vincent: For large CSV files, don't use Git for data. Use a different system for versioning. Providing this context helps people understand the best practices. (35:54)

Alexey: I've never taken a course for Git, Docker, or Python. I learned by figuring out how to do things when needed. Maybe assuming people will learn as they go is also a good approach. (36:50)

Vincent: To make learning Docker easy and enjoyable, focus on getting people to a 'minimum viable tinkerability' where they're comfortable experimenting. (37:16)

Vincent: Once people reach that level, encourage them to tinker as much as possible. That's a philosophy I'm exploring. (37:16)

Alexey: That's a nice idea. I realized today's topic is actually not education, even though this is great stuff to talk about. (38:14)

Vincent: Sure, let's segue back to Scikit-Learn stuff. (38:22)

Dev Rel and Core Dev in One

Alexey: You work at :probabl. as a Developer Advocate and also implement core features. How do you manage both roles? (38:26)

Vincent: At Explosion, we experimented with a Dev Rel team but decided everyone should be a machine learning engineer. As a good Dev Rel, you should also be skilled in your domain and comfortable creating content. (39:07)

Vincent: It's like being a full-stack developer with a focus on Kubernetes. I see myself as a machine learning engineer with added skills in creating content. (39:07)

Vincent: At :probabl., I'm bootstrapping the Dev Rel practice. I have a whiteboarding playlist, a live stream, and a podcast. Once these are established, I'll focus more on open-source contributions. (39:07)

Vincent: I'm helping with the Skrub effort, doing benchmarks and sharing ideas. I see myself as a senior person working on what matters to the company. (39:07)

Alexey: Yet your title is Dev Advocate, right? Or what's your title? (41:17)

Vincent: It's Developer Relations Engineer. I joked about preferring 'Senior Person,' but it's about fixing problems for the company. Titles like 'senior' or 'junior' feel counterproductive. I do a lot of Dev Rel stuff, so Developer Relations Engineer is fine. (41:21)

Why :probabl. Needs a Dev Rel

Alexey: Why does Scikit-Learn need a Dev Rel? (42:13)

Vincent: :probabl. hired me, not Scikit-Learn. Scikit-Learn has great documentation, thanks to colleagues like Arturo, who leads the docs effort. But there's always more we can do, like interactive code examples and maintainingers' experiences through podcasts. (42:20)

Vincent: Scikit-Learn's docs are amazing, but additional content like YouTube videos explaining algorithmic details adds value. We recently reached 10,000 views on our YouTube channel. (42:49)

Alexey: There is already a ton of content explaining Scikit-Learn. But my job is to promote :probabl. and, through it, Scikit-Learn. While the Scikit-Learn community creates MOOCs and other resources, we aim to add value in different ways. (44:07)

Vincent: Via :probabl., I promote Scikit-Learn. For example, we have scalers in Scikit-Learn to standardize data. The Standard Scaler subtracts the mean and scales the variance, but there are many complexities involved. A video explaining these details helps users appreciate the intricacies. (44:30)

Vincent: The Standard Scaler must handle various data types, like sparse matrices and data frames, and support partial fit methods for microbatching. These details are hard to cover in a tutorial but are valuable for users to understand. (44:59)

Vincent: This is a video coming out this week. Users don't need to know all these details, but appreciating them can be helpful. (44:59)

Alexey: I've never realized that such a simple thing could be that difficult. (47:20)

Vincent: The Standard Scaler is Not Standard is the title of the video coming out this week. (47:25)

Alexey: I see. It's too mathematical. 'Compute the mean, subtract the mean...' (47:29)

Vincent: On the spectrum of math, this is lightweight, but so much can go wrong. (47:38)

Alexey: Yeah, I never realized that. (47:46)

Vincent: Looking at Scikit-Learn source code helped me understand how to implement things in Scikit Lego. It equips me to know which parts are worth diving into. (47:48)

Exploration of Skrub and Advanced Data Processing

Alexey: What is Skrub? You mentioned it several times. (48:27)

Vincent: Skrub is a Scikit-Learn plugin in an experimental phase. Gaël Varoquaux and others are working on it. The goal is to simplify handling tabular data with components like the table vectorizer, which automatically determines the best way to process different types of data. (48:31)

Alexey: This is how you pronounce his last name? (48:42)

Vincent: I'm not sure. Let's call him Gaël. The table vectorizer is an example of Skrub's goal to handle tabular data efficiently. It applies sensible defaults to different types of data, providing a reasonable benchmark with minimal effort. (48:44)

Vincent: These components are too experimental for Scikit-Learn but offer pragmatic and useful solutions. (50:13)

Alexey: Sounds quite cool. (50:25)

Vincent: One feature in Skrub is the GAP encoder, which handles dirty categories by modeling them as text and clustering similar items. This prevents the explosion of one-hot encoding and offers efficient data processing. (50:27)

Vincent: For example, job titles with typos can be grouped into topics, reducing the complexity of encoding. Skrub aims to provide tools for efficient data processing, making it easier to achieve solid benchmarks. (50:40)

Alexey: In our courses, we address questions about handling large numbers of categories. Skrub offers practical solutions, allowing users to start with minimal effort and then fine-tune their approach. (52:09)

Vincent: Skrub may not be perfect for every use case, but it provides a solid starting point. Users can dunk a data frame in and get a reasonable benchmark, appreciating the tools used under the hood. (52:34)

Alexey: Instead of listing all options, the answer could be, 'Try this library and see what it comes up with.' This approach helps users understand the encoders and appreciate the process. (52:46)

Vincent: While Skrub can't handle every unique case, it provides sensible defaults for common scenarios, like encoding dates and times. (53:06)

Alexey: Scikit-Learn existed without a company behind it for a long time. Why start one now? (53:36)

Vincent: My perspective is that relying on academic funding models is risky for such a central open-source project. Creating a company can provide more stable funding and support for Scikit-Learn. Additionally, there's tremendous value in the project, and some companies might be willing to pay for it. (53:47)

Vincent: Having more European tech companies is also a goal. It would be nice to have more tech companies from Europe, similar to how France has Hugging Face and Mistral. Being a company exposes you to industry problems, which is beneficial for the project. (53:47)

Vincent: A company makes sense for these reasons. The exact business model is still developing, but training and consulting are likely components. (53:47)

Alexey: And the business model is still yet to be determined? The exact business model. (56:15)

Vincent: There are ideas like training and consulting. Collaborations with cloud providers might happen, but it's still early. You can check the TechCrunch article for more details. (56:19)

Alexey: [The article] was published on February 1st? (56:58)

Vincent: Earlier this year. That's when the official announcements started. (57:03)

Alexey: Yeah. The website mentions 'Open source services – provide training, certification, and expert solutions for enterprise AI challenges.' (57:10)

Vincent: Yes, that covers it. (57:19)

Vincent’s Upcoming Projects

Alexey: We don't have a lot of time left. What's your next personal project? (57:24)

Vincent: Calm Code will have a book about expectations versus reality in the field of data. It will cover overpromised aspects of data science and share anecdotes and stories. (57:34)

Vincent: Back in the day, data science was touted as the sexiest profession. Looking back, many promises were overhyped. The book will address these stories and focus on culture and preventing failures. (57:50)

Vincent: I want to write about the clash between expectations and reality in data science. The book will include anecdotes and stories from the field. (58:11)

Vincent: We also have a live stream at :probabl. where we explore new technologies. For example, I'm looking into converting tree-based models into SQL queries for efficient processing. (58:17)

Vincent: I'm exploring whether converting tree-based models into SQL queries can optimize large batch jobs. It's an experiment we'll figure out on the live stream. (59:31)

Alexey: Do you prepare for the live stream or is it complete exploration? (1:00:24)

Vincent: I prepare, but part of the stream is live coding and sharing insights. It's important to be prepared, especially when demoing other projects. (1:00:27)

Alexey: Okay, that's all we have time for today. Thanks for joining us and sharing your experience and future plans. (1:01:15)

Vincent: Have a good one! (1:01:53)

Alexey: Yeah, you too. Have a great week! (1:01:55)

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.