AI Dev Tools Zoomcamp: Learn AI-powered coding assistants and agents Register here!

DataTalks.Club

Season 2, Episode 3

Contribute to Open Source ML: scikit-learn Pipelines, PRs, Docs & Rasa Conversational AI | Vincent Warmerdam

Show Notes

How do you start contributing to open source ML projects like scikit-learn pipelines—or move from curious user to confident contributor on Rasa’s conversational AI stack? In this episode, Vincent Warmerdam, Research Advocate at Rasa and creator of The Algorithm Whiteboard and calmcode.io, walks through practical, hands-on advice for contributing to open source ML.

Vincent shares his career pivot from design student to data scientist and highlights projects (evol, clumper, memo, whatlies, scikit-lego) that illustrate small-tools-to-impact workflows. We deep-dive into scikit-learn–compatible pipeline components, design principles for low-maintenance APIs, and common mistakes such as publishing to PyPI too early. You’ll get a documentation checklist (README, guides, API reference, examples), guidance on filing reproducible issues, and step-by-step preparation for pull requests: testing, CI, packaging, and pre-commit hooks.

Listeners will leave with concrete strategies for finding the right project, balancing large vs. small repositories, community stewardship and contribution etiquette, and ways OSS work can boost career visibility through talks, blogs, and meetups. If you want actionable next steps for contributing to open source ML, scikit-learn pipelines, PRs, docs, or Rasa conversational AI, this episode maps the path.

Today we’re talking open source with our guest, Vincent Warmerdam. Vincent is a Research Advocate at Rasa. If you check his LinkedIn, you’ll see a lot: he’s made Reddit’s front page, runs calmcode.io for learning to code, has organized PyData Amsterdam and AI Saturdays Amsterdam, and he’s a data evangelist and open-source enthusiast who’s created and maintains several open-source packages. And—last but not least—he has over 80 LinkedIn endorsements for “awesomeness.” Welcome, Vincent!

Vincent: Hi! About those “awesomeness” endorsements—at a previous company I had a running bet with a CTO: who could get the most “awesomeness” endorsements without asking. So yes, I’m slightly cheating there—but it’s been a years-long joke.

In this interview, we covered:

Vincent’s Journey and Open Source Philosophy

Q: Could you walk us through your career journey and how you ended up in open source?

Vincent: I grew up bilingual—born in the U.S., raised mostly in the Netherlands, with lots of travel back to the U.S. After high school, I wanted to study design, but I got a bad grade for not using a prescribed brainstorming framework—so I pivoted. I discovered econometrics (applied math), and they promised I’d predict stocks… which, spoiler alert, you can’t. I did a master’s in operations research instead. Around then, AI/ML was getting hyped—there was the free Stanford AI course with Norvig and Thrun; I was an early student. I realized I liked programming more than suit-and-tie consulting.

I backpacked through Latin America with a laptop, did some consulting, and told myself: if I prefer coding to going clubbing every night, that’s my sign. Coding won. Back in the Netherlands, I taught at a business school (good money), but wanted data science. I met someone at a meetup who was starting a Hadoop company and needed math for ML—that got me in. I helped bring PyData to the Netherlands and did community work (I’m no longer formally involved). I blogged, built side projects, and Rasa noticed. They asked if I could do my “Vincent stuff” on their stack. It took me a month to realize they meant a full-time role.

Q: As a Research Advocate at Rasa, what does your role actually involve, and how does it differ from a traditional Developer Advocate?

Vincent: The title is new—we’re still defining it, like data science roles were years ago. Think “developer advocate,” but closer to research. I make sure non-English languages are better supported (e.g., right-to-left languages like Arabic need different handling). I listen to practitioners building assistants and surface their needs; I also translate what our researchers and engineers build into practical guides. Sometimes that’s a blog or video; sometimes it’s an open-source tool. I even co-authored a research paper by accident because a tool I made was useful to researchers. I ship a lot of “byproducts”: debugging tools on top of Rasa, word-embedding exploration and bias tools, etc.—not the core product, but helpful for the community.

Q: For those unfamiliar with Rasa, could you explain what the company actually does and how it’s positioned in the market?

Vincent: Think of Elastic: they build search infrastructure; they don’t build every search app. Rasa builds the infrastructure for virtual assistants: standard pipelines, components you can swap, deploy on your own servers (important for healthcare, etc.). Open source and pragmatic—pip install and go.

Q: How do you personally define open source, and what philosophy guides your approach to creating and sharing open source tools?

Vincent: I’m not a license wonk—there are nuances—but my framing is pragmatic. Others shared tools that made my career possible; when I have useful tools, I share them back. Scikit-learn is amazing, but there are experimental tricks they won’t include. Nothing stops me from publishing compatible extras.

One caveat: putting code on GitHub is fine, but publishing on PyPI should wait until it’s mature. I have a project called Brent that I put on PyPI too early—I’d prefer it had stayed GitHub-only until it was ready.

Q: Your pinned tweet emphasizes “preferring common sense over hype.” How does this philosophy apply to open source development and avoiding the trap of solving the wrong problems?

Vincent: It’s easy to optimize a metric by 1% while ignoring a 10% mismatch between the real business problem and the analytical proxy. Keep the real problem front and center.

Building and Managing Open Source Projects

Q: You’ve created and maintain several popular open source packages. Could you walk us through some of your key projects and explain how they came to be?

Vincent: Ideas come from curiosity or wanting a better API. Examples:

  • evol — evolutionary algorithms with a cleaner API (functional style with Population/Evolution objects) so it’s easier to maintain than nested for-loops
  • clumper — “pandas, but for nested JSON” with a friendly API
  • memo — decorators that log function outputs; great for grid searches so you get a JSON trail
  • sklego (scikit-lego) — extra “lego bricks” for scikit-learn
  • human-learn — rule-based systems inside scikit-learn (“if this then that”), making human rules grid-searchable
  • whatlies — explore what lies in word embeddings (visualization + scikit-learn-compatible transformers so you can benchmark quickly)
  • Smaller tools like schedulelord (cron helpers for Raspberry Pi) and make-test-docs (two functions to treat docs/examples as unit tests)

I try to design for low maintenance: small, focused APIs, good docs, and compatibility with existing ecosystems—especially scikit-learn—so people can slot tools into pipelines and compare fairly (“is the 10× slower model also 10× better?”).

Q: Let’s dive deeper into scikit-lego. What specific problems does it solve, and how do these “lego bricks” integrate with existing scikit-learn workflows?

Vincent: Exactly—scikit-learn–compatible pipeline components. Some are fancy; some are simple but practical. Example: if your model expects values in [0, 1] but sees a 1,000,000 at inference, it’ll behave badly. We have a transformer that clips values to a max—so that outlier becomes 1. It’s a drop-in component rather than reinventing the wheel.

We also have interesting meta-components. Take classifier thresholds: instead of always using 0.5, you can grid-search the threshold to trade precision/recall. Our components let you tune that inside a pipeline.

Q: Your project names are wonderfully creative and descriptive. What’s your process for naming projects, and why is good naming so important?

Vincent: I actually made makenames.io with a friend—a silly name generator trained on Pokémon and IKEA catalogs. I’ll generate names for 10 minutes, get frustrated, and then blurt out the clear description—that often is the name. I want names that say what they do: scikit-lego (lego bricks for scikit-learn), human-learn (human rules in sklearn), whatlies (what lies in embeddings), clumper (clump JSON), memo (memorize/log). At a former job, we turned naming into a team sport—Steven Sequel for a SQL service; the upgrade was Steven C. Call: The Sequel; a document store named JSON Bourne. It sounds goofy, but it boosted team energy and clarity: a good name forces you to be clear about what the thing does.

Q: What are the key elements that make an open source project successful and maintainable, and how do you handle community contributions while maintaining a positive environment?

Vincent: Not just code—stewardship. Great docs and a clear path from “zero to solving a problem.” My doc checklist:

  • README/Home: what it does, installation, a concise problem statement, maybe a logo, contribution notes
  • Guides: a “Getting Started” and at least one advanced guide
  • API reference: every function/method clearly described (images help)
  • Examples: end-to-end notebooks showing real tasks (e.g., bias analysis with word embeddings, benchmarking embeddings for Arabic)

If there’s no “Getting Started,” adoption dies. Good docs are part of the product.

For contributions, I include a guide—on whatlies, the README has it even if the docs site doesn’t. I try to set expectations: every issue is considered, but that doesn’t mean it’ll be implemented immediately. Stewardship also means nudging for good behavior. Sometimes folks are… unfriendly. Please don’t write in ALL CAPS—it reads like shouting. I’m doing volunteer work; basic courtesy helps. I once had someone insist I must use Bokeh instead of Altair and then refuse to contribute—heated debate, zero code. A contribution guide should discourage that and keep discussions constructive.

How to Start Contributing to Open Source Projects

Q: For someone who wants to start contributing to open source, what would you recommend as the best first steps, and how should they approach contributing to large projects versus smaller ones?

Vincent: It depends on your goal. Shipping your own library is different from contributing to an existing one. A great first contribution is using a tool, hitting a confusing error, and opening a clear GitHub issue with a repro and suggested fix (even just a better error message). That already helps maintainers a lot.

At Rasa, we recognize contributors beyond code—merged PRs, good talks, helpful blog posts. If you want to make your first code PR, invest in ecosystem basics: setup.py/packaging, pytest, flake8, black, pre-commit hooks, Git/GitHub workflows, and CI (e.g., GitHub Actions). Calmcode.io has short tutorials on many of these.

For large projects like scikit-learn, there’s lots of traffic and process—they require algorithms to stand the test of time. Consider smaller projects where discussion is easier. Start by opening an issue to propose your idea—don’t code a big feature before the maintainer is interested. As a maintainer, I worry about (1) general usefulness (vs. niche) and (2) long-term maintenance—will the contributor stick around? Address those in your proposal.

Projects like DION (a checklist for ML risks and unintended side effects) welcome contributions—even anecdotes for docs. It’s impactful and approachable.

Q: How do you manage to find time and energy for so many open source projects while maintaining quality, and what productivity strategies do you use?

Vincent: A lot of my open-source work happens in my employer’s time, with their support—so I learn at work and apply it elsewhere (and vice versa). Personally, I’ve shifted my free time: fewer video games, more tinkering, plus exercise, friends, and family. Productivity tip: don’t start by coding. Use paper or an e-ink tablet to design the solution first. You should know the shape of the solution before typing a character; exploratory notebooks are the exception, not the rule.

Career Advancement Through Open Source Projects

Q: Many developers have internal tools at work that could benefit the broader community. How can someone convince their employer to open source internal libraries, and what are the business benefits for companies?

Vincent: Position it as hiring and brand leverage. Cool OSS attracts attention—people check out the company behind it. Talks about your tools boost morale and serve recruiting. There was even a company that built a whimsical “garbage fire” demo; the blog about it drove real product interest. More broadly, some companies give employees a day a month for OSS. Treat it as training—contributing grows engineering maturity (Git, CI, packaging, reviews). Yes, regulated domains (e.g., finance) may limit this; be respectful of legal constraints.

Q: How can open source work and conference speaking help with career advancement, and what strategies have helped you build visibility in the industry?

Vincent: For conference talks, write your proposal so a reviewer thinks, “I’d attend this.” At PyData Amsterdam, our rule was just that. Entertaining + educational is gold, but either alone is fine. Great talks often come from “simple but insightful” topics: a weird pandas trick that saved your day, how JSON parsing really works, or a fun dataset (e.g., “Which English words are the most metal?”). Many people don’t realize their blog post could be a fantastic talk—submit it!

OSS contributions can de-risk you as a candidate—if you’ve meaningfully contributed to tools a team uses, they know you understand them. It’s not a silver bullet, and lack of OSS (family, time, etc.) shouldn’t disqualify anyone. Talks and blogs work too—my most-watched talk was about winning with simple models (yes, linear regression). Clear thinking > flashy hype.

For building visibility: meetups and teaching. I gave free R trainings to app devs, organized events, and said “yes” a lot. Visibility scales better than trying to meet everyone individually. Luck played a role too: being early in “data science,” building first recommenders, living in Amsterdam—all helped.

Q: As we wrap up, could you reflect on how the Research Advocate role at Rasa differs from traditional Developer Advocate positions, and what you see as the future of this type of role?

Vincent: DevRel is a well-trodden path—there are even DevRel conferences. Research Advocacy at Rasa is similar but closer to research. I explore ideas, prototype, sanity-check with researchers, and bring community feedback back to the team (especially for non-English use cases). Next, I want to build a personal Slack assistant with Rasa—automating my own workflows—then share the journey so developers can replicate it. It’s a two-way bridge: community ↔ research.

Timestamps

Click any timestamp to jump to that moment in the video

Transcript

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Alexey: Hi, welcome everyone. This event is brought to you by DataTalks.Club, a community of people who love data. We have two types of events. On Tuesdays, we usually have more technical events where we talk about technical topics, usually with a presentation with slides, like a webinar. (2.0)

Alexey: We will not have many of those in February because we have a conference, which I’ll talk more about later. We will come back to these events in March, and we already have one planned for March 2nd, about building scalable end-to-end data learning pipelines in the cloud. That’s a long title. (22.0)

Alexey: We also have a different kind of event, usually on Fridays. Today’s event is this type, even though it’s Thursday, where it’s more like a conversation. We talk about different things. Today the topic is open source. Tomorrow we will talk about envelopes. (47.0)

Alexey: On February 2nd, which is a Tuesday, we will talk about feature stores. Again in February, we will not have many events because of the conference. We will come back in March with a topic on public speaking. (1:10)

Alexey: Regarding the conference, it happens every Friday in February, with four tracks each day. First, we will talk about different use cases of machine learning. Then we’ll talk about products and processes. After that, we’ll talk about careers in data. Finally, we’ll cover machine learning in production. (1:29)

Alexey: The link is on our website, datatalks.com, where you can find more information. For questions today, we will use Slido. I will share the link now in the chat. (2:00)

Alexey: Here is the link we can use for asking Vincent any questions today. If you need to tend to a kid for a brief moment, that’s fine. We will start shortly. (2:29)

Alexey: Let’s start. Today we’ll talk about open source. We have a special guest, Vincent Warmerdam. Vincent works as a Research Advocate at Rasa. (3:09)

Alexey: If you check his LinkedIn profile, you will see quite a few things. He has been on the front page of Reddit. He is making a resource for people who want to learn coding, calmcod.io. He also organized the PyData Amsterdam conference. (3:45)

Alexey: He is a data evangelist and an open source enthusiast. He is a creator and maintainer of some open source packages. That’s why we invited him today. Last but not least, he has over 80 endorsements for awesomeness on LinkedIn. (4:03)

Alexey: Welcome, Vincent. Thanks for coming to our event. (4:25)

Vincent: Hi. About the endorsements for awesomeness, at a previous company I had a bet with a CTO. We played a game to see who could get the most points for awesomeness on LinkedIn without asking for them. I might have cheated a little, but it wasn’t about bragging. It was a long-running bet. (4:31)

Vincent: When I was reading your bio on LinkedIn, I thought, “Okay, awesomeness, let me check.” Of course, even one sentence influences the score. (4:53)

Alexey: Before we get into open source, let’s talk a bit about your background. Can you tell us a bit about yourself and your journey so far? (5:08)

Vincent: I don’t have a typical background. I was raised bilingually. I was born in the United States, lived there for a bit, and did most of my youth in the Netherlands, but I traveled to the U.S., which explains my accent. After high school, I wanted to study design, which I was keenly interested in. (5:20)

Vincent: I remember getting a bad grade because the feedback was: “Vincent, you’re very creative, but you’re not using our 12-step program on brainstorming.” I thought, well, I don’t like where this is going. Then I learned about econometrics, an applied math field. They promised that I could predict stocks. That sounded awesome. (5:44)

Vincent: After the bachelor’s, I found out it’s kind of marketing it doesn’t actually predict stocks. So I quickly decided to take a master’s in operations research. Around that time, AI and machine learning were becoming popular. I took a free Stanford course with Peter Norvig and Sebastian Thrun, and I realized I liked programming more than consulting. (6:13)

Vincent: I decided to try a career switch. I backpacked through Latin America with a laptop, doing some consulting along the way. I realized I preferred programming over going to clubs every night. When I came back to the Netherlands, I taught at a business school, but I wanted to focus on data science. A person I met at a meetup was starting a Hadoop company and needed someone who understood the math behind machine learning. (6:48)

Vincent: I got a gig there, and the ball started rolling. I wanted PyData to happen in the Netherlands, so I did some community projects. I’m no longer formally involved with PyData, but I’m still around. I had lots of side projects, which eventually caught the attention of people at Rasa. They asked me to do my Vincent stuff on their stack, because they thought I could help. (7:48)

Alexey: So that’s how you got hired. (8:21)

Vincent: Yes, it took me a while to realize they wanted to hire me permanently. For a long time, I thought it was just a collaborative project. Eventually, they said: “We’re talking about a career here, not a side project.” It took me a month to realize that. I’ve been there almost a year, and I really enjoy the culture. (8:27)

Vincent: It’s a diverse mix of people. It’s fun, and it’s not just about machine learning. One thing I try to pick up is how to work in non-English languages. For example, word embeddings trained left-to-right for English don’t work for Arabic. I get to reach out to communities, say here are some tools, and get feedback. (9:00)

Vincent: This led to a couple of open source projects, which also led to some articles. The main thing I try to do is help people understand tools. Our engineers and researchers develop a lot of features, but having a tool is one thing understanding it is another. I like teaching, so that’s my focus. (9:35)

Alexey: Your title is Research Advocate, right? (10:06)

Vincent: Yes, that’s the title we came up with. (10:13)

Alexey: What are your responsibilities there? (10:18)

Vincent: It’s hard to explain. The data science field is still figuring out what it means to be a data scientist. Six years ago, being a data scientist meant figuring it out yourself. It involved a bit of programming, a bit of this, a bit of that. (10:23)

Vincent: As a Research Advocate, I’m trying to figure out what it means while teaching and helping others. There’s freedom in shaping it myself. (10:53)

Vincent: The thing is, especially with non-English languages, I really like the attitude that this job allows me. I do my best to make sure that a lot of these non-English tools are well supported in Rasa. I don’t want to give the impression that I know everything someone fluent in Arabic knows far more than I ever will. (10:59)

Vincent: A lot of what I do is just listening to people: “Hey, you’re trying to make this digital assistant work with open source tools. What are your problems?” Maybe I can help benchmark something, or I can proxy it to the research team. They have really cool tools, and I make sure they’re well understood. (11:18)

Vincent: Sometimes the best thing I can do is make a YouTube video or a blog post. Sometimes it’s contributing to open source tools. That’s the ground I’m trying to cover. It’s a little bit of improvisation, but it’s good improvisation because I don’t have to pretend I know everything. (11:42)

Vincent: Sometimes, as a consultant, you’re supposed to be the authority all the time. Here, I can just say: “Look, if you want to make a virtual assistant in any language community, tell me what I need to do, and I’ll gladly help.” That’s my main focus. (12:02)

Alexey: So basically, you’re someone between the technical team and the user? (12:28)

Vincent: Yes, it’s like a developer advocate with more focus toward research. I’ve actually written a research paper by accident this year. I made an open source package that’s super useful to some researchers. I make a lot of byproducts, which are useful to the community. (12:34)

Vincent: Some of these are machine learning debugging tools that work on top of Rasa. Others help explore word embeddings or investigate bias in them. These aren’t core to what we do, but they help our community, which is great. (12:57)

Alexey: Can you explain a bit about the company, Rasa, and what you’re developing? (13:16)

Vincent: It’s a little different. Elasticsearch helps with search, but the company isn’t implementing it for everyone directly. They have support contracts. (13:22)

Vincent: On our side, we help companies make virtual assistants, but we focus on building the infrastructure. We provide the standard pipeline for a virtual assistant. If you need a very specific component, you can make that. We’re trying to provide the infrastructure layer. (13:38)

Vincent: We aim to make it open source and pragmatic, so you can work with it. For example, you might want to host it on your own servers for healthcare applications. We provide the tools to do this in an open source way. It’s as simple as a pip install. (13:57)

Alexey: Speaking of open source, what does open source mean to you? (14:22)

Vincent: I have to admit, I don’t consider myself super knowledgeable about all the details, like licenses. Some people take licenses very seriously. My approach has always been pragmatic: people made a bag of tools and tricks that are useful to me. I sometimes create my own tricks that others might not have made. (14:30)

Vincent: The best example is Psychic Lego, which I started with Matias from PyData Amsterdam. Some tools, like SciKit-Learn, are amazing, but there are tricks I use often that are experimental. I host those myself. (15:06)

Vincent: This is how I approach open source: I scratch an itch, and others can find edge cases and help improve the code. That wouldn’t happen if I kept it private. (15:39)

Alexey: So everything on GitHub is automatically open source, as long as it’s not private? (15:56)

Vincent: Yes, but I make a distinction. I have an open source project called Brent that I put on PyPI, but it’s not fully ready. There are edge cases that need to be fixed before it’s fully public. Hosting on GitHub is fine, but there’s a maturity step before publishing to a public package index. (16:02)

Alexey: You mentioned Psychic Lego and Brent. Can you tell us more about them? (16:44)

Vincent: Open source often starts with curiosity: I want to explore an idea or improve a user interface. For example, the library EVIL is an evolutionary algorithm library. The name is an anagram of “love evil.” (17:02)

Vincent: I made EVIL with GoHere. We wanted to simplify genetic algorithms, which are usually a for-loop inside a for-loop inside a for-loop. We created a population object and an evolution object with a functional API. This makes evolutionary algorithms easier to use. (17:43)

Vincent: I haven’t touched the library in years, but it still gets a few hundred downloads a week. People blog about it because the API is simpler. Another library is Clumper, like pandas but for nested JSON objects, to make the API easier. (18:12)

Vincent: Memo is another tool, a set of decorators that log function outputs. If you’re doing grid search, Memo saves everything in a JSON file. That’s one branch of packages I make: improving usability. (18:39)

Vincent: The other branch integrates with existing ecosystems. For scikit-learn, I made Psychic Lego, HumanLearn, and other tools. HumanLearn allows rule-based systems so domain experts can use “if this, then that.” (18:57)

Vincent: I also made WhatLies, which investigates word embeddings. It’s pipeline-compatible with Psychic Learn, so you can benchmark models quickly. You can check if a slower model is actually more accurate. (19:22)

Alexey: You mentioned quite a few packages there were at least ten you listed. (20:07)

Vincent: I also do lots of stuff for Rasa, but that’s my employer, so that’s cheating. I write my packages so they don’t require a lot of maintenance. For example, I have one called ScheduleLord, which helps me maintain my cron jobs on my Raspberry Pis. It has documentation, but in reality, it has five users, and that’s fine. (20:24)

Vincent: I also have another tool called MakeTestDocs. It’s a package with two functions that make it easier to write unit tests for your markdown files. If you have Python examples in your README or MakeDocs template, you can just add one or two functions, and your documentation becomes your unit test. It’s easy to maintain two functions. (20:55)

Alexey: What caught my attention was the names of your packages. They’re very creative. How do you pick names for your projects? (21:26)

Vincent: My approach is pretty silly. A while ago, a buddy and I made a website called MakeNames.io. The idea is to use Pokémon names as a training dataset to generate new names. We also used the IKEA catalog as a corpus. It’s not meant to be a serious project. (21:37)

Vincent: Usually, I spend ten minutes generating names. If I get frustrated because they’re silly, I turn that into a sentence describing what the package does. For example, Psychic Lego is for building Lego-like blocks in a scikit-learn pipeline. HumanLearn lets humans put rules into scikit-learn. WhatLies investigates word embeddings. Clumper helps JSON clump together. Memo helps memorize stuff. (22:14)

Vincent: Naming is often the hardest part. In a past career, I did consultancy for the Dutch Flower Auction. The team culture was great because we named every new service. For example, a SQL database service was called Steven Sequal. When we upgraded it, we called it Steven C. Call the Sequel. A document store was called Jason Bourne. (23:07)

Vincent: This silly naming competition made every service launch exciting. Communication improved because the names clearly described what the server should do. (23:45)

Alexey: So for Psychic Lego, the name reflects building blocks that can be combined into a pipeline? (25:05)

Vincent: Yes. It’s a scikit-learn-compatible pipeline component. Some are fancy, some are simple. For example, if you train a model and get an outlier during prediction, we have a component to clip it. It might not be useful for everyone, but it’s ready if someone needs it. (25:17)

Vincent: We also have meta-components. For instance, a classifier might use a threshold of 0.5. You could grid search to find a threshold that optimizes precision or recall. These tools help you investigate and adjust the model. (25:52)

Alexey: We have a question about getting started with open source. I’ve been coding for a while and will study computer science at university. How should I begin contributing to open source? (26:26)

Vincent: It depends on what you want to achieve. If you want to start your own project, ask if it’s easier to maintain it and convince others to use it, or contribute to an existing project. You can talk to maintainers and see if they’re interested in your contribution. (26:44)

Vincent: If you’re a maintainer, you understand the project well, but you might not see what’s unclear to new users. If you hit a confusing error, open a GitHub issue suggesting improvements. That alone is already a contribution. (27:12)

Vincent: The first step is to use a tool, identify a problem, and create an issue. At Rasa, we have a contributor program. You can join via Slack, get PRs merged, give talks, or write blog posts, and we consider you a contributor. (28:00)

Vincent: For your first commit, don’t worry too much. Make sure you understand setup.py, flake8, black, pytest, and pre-commit hooks. Understanding Git and GitHub, including commits, merges, and pull requests, is also important. (28:40)

Vincent: Investing in these skills is sensible. Learn Python, Git, and GitHub basics, plus continuous integration and GitHub Actions. My site, Calmcod.io, has tutorials for all these tools. It helps you gain programming maturity beyond Jupyter notebooks. (29:55)

Alexey: Let’s say I want to contribute to a library like Psychic Learn. I know some Python and Git. What are the next steps? (30:13)

Vincent: The path I took is different because I organized events. Most packages I contribute to are maintained by someone I’ve met in real life. That makes it easier to discuss ideas. (30:32)

Vincent: Big projects like pandas have thousands of issues. Maintaining them is tricky. It may be better to start with a smaller library. Reach out on GitHub issues and see if maintainers are interested in your feature before starting work. (30:51)

Vincent: I contributed to a project called Dion. It’s a checklist for starting a machine learning project, checking for unintended side effects before deployment. They also accept anecdotes for documentation. It was a small, manageable way to contribute. (31:31)

Vincent: Dion is, as far as I’m concerned, one of the most impactful projects for preventing a lot of bad outcomes in machine learning. Dion is amazing. If you haven’t checked it out, you really should. We’ll share the links after the event. (32:00)

Alexey: So your suggestion is that big projects like pandas or scikit-learn might be too large to start contributing to. For example, in scikit-learn, there’s a policy where new algorithms must stand the test of time. You can’t just submit something you created yesterday. (32:18)

Vincent: Exactly. Smaller projects are easier to get started with because the codebase is smaller, there are fewer issues, and it’s easier to communicate. The main thing is psychological: maintainers worry about whether a feature is broadly useful and whether contributors will stick around if something breaks. (32:48)

Vincent: It’s easier to have these conversations in a smaller project setting. You can explain your interest in helping maintain the project. No one wants to inherit legacy code without context. (33:35)

Alexey: Another question: how do you find the time and energy to work on so many side projects? You mentioned 10, 20, quite a few, in addition to your work at Rasa. Do you have any productivity tips? (33:47)

Vincent: A lot of my open source projects happen during my employer’s time. Even if it’s not personal work, I’m still learning and maintaining products on the job, which translates to my open source work. My employer doesn’t mind, which makes it easier. (34:13)

Vincent: I also realized after turning 30 that some activities, like video games, weren’t as fun as I thought. These days, I prefer hanging out with friends, exercising, and spending time at home with my wife and cats. (34:44)

Vincent: I also use an e-ink drawing tablet. The most productive thing you can do is think through a technical problem before typing a single character. Having the solution conceptually ready prevents wasted effort. (35:15)

Vincent: Sometimes you need to experiment in a notebook, but many people start coding without a plan, which wastes time. Planning first makes you more productive. (36:03)

Alexey: I remember your pinned tweet about this. How does it go? (36:08)

Vincent: It’s basically: “Vincent prefers common sense over hype. Let’s not solve the wrong problem.” I stand by this, especially in data science. People often optimize a numerical metric without considering errors in the problem translation itself. (36:26)

Alexey: Coming back to open source, in your opinion, what makes a good open source project? (37:05)

Vincent: There are a few things. One concept is stewardship. Being a steward of a project means more than just code; it’s about maintaining the ecosystem around it. (37:12)

Vincent: Even if your algorithm is twice as fast, if the documentation isn’t clear, no one will care. Docs, GitHub organization, and a clear onboarding path are crucial. (37:40)

Vincent: I like projects that make a real effort. If a project only has API references but no getting started guide, that’s frustrating. The path from knowing nothing to solving a problem must be defined. (38:00)

Vincent: For larger projects like Clumper, I like a homepage overview, a README with installation, problem description, and contribution info. Then a guides page for getting started and advanced topics. I also include an API reference with pictures for clarity. (38:36)

Vincent: I include example pages too. For WhatLies, there’s a page on research and bias in word embeddings, and one on working with Arabic embeddings. Examples give context and make it easier to reproduce results. (39:29)

Vincent: Documentation isn’t complete unless there’s both an API and examples. That’s what I aim for in more complex projects: a clear README, tutorials, advanced guides, and example usage. (39:58)

Alexey: What about a contribution guide? Should that be part of the checklist? (40:25)

Vincent: Yes, I usually include it. For example, in WhatLies, the README mentions contribution guidelines on GitHub. I try to make it clear that all issues are considered, but I avoid suggesting every issue will be implemented immediately. (40:38)

Vincent: One thing about stewardship I want to pay attention to is that I’ve noticed on some of my repos, some people are really unfriendly. You’re just wondering, are you sitting on a bag of nails? What is wrong with you? Why are you so angry that one particular feature isn’t in there? I’m doing volunteer work maintaining this stuff. (41:04)

Vincent: Regarding contribution guides and etiquette, I really try to emphasize small things. Don’t write sentences in all caps. It reads as if you’re shouting to me, and I’m doing volunteer work. I try to steer conversations on my issue list in that direction. I remember one guy on WhatLies saying, “Hey, you’re using Altair, but you should be using Bokeh. It’s way better.” That would lead to a really heated debate, and he wasn’t even willing to contribute. It’s a really weird situation, and it’s something I try to pay attention to. (41:28)

Vincent: If you’re starting out, you should probably not worry too much about that, but the basic rule is: don’t use all caps when adding an issue. That’s rude. Most of us don’t get paid to work on open source; most of us are doing it voluntarily. (42:02)

Vincent: Sometimes we work on internal libraries at our company that aren’t necessarily specific to the company. You might think, “I want to release this as an open source project because other people will find it useful.” How do you convince your employer? I consider myself lucky because I’ve been organizing conferences and stuff, so it makes it easier for me to make this claim. (42:22)

Vincent: I can approach my employer and say, “Hiring people is marketing. If you want to hire talent, you want to show that your employees work on cool, interesting tools.” Even a small, silly project reflects well. For example, there’s a library that allows you to make Matplotlib charts in the command line ASCII art, following the Matplotlib API. It’s silly, but people notice. If a company produces fun, useful tools, it attracts attention. (42:58)

Vincent: Open source projects get attention, and presenting them at conferences is good for morale and hiring. Some companies gain visibility because their employees present open source work. I’ve seen a case where Basecamp stopped spending on marketing and instead built a “garbage fire” literally a conveyor belt burning things people emailed in. It was silly but drove traffic and engagement. (43:45)

Vincent: Convincing your employer depends on your company. Financial companies may have stricter rules due to legal concerns. But most tech companies are hiring, and showing open source contributions helps with employer branding. Some companies even allow one day per month dedicated to open source. (45:00)

Vincent: You can also view this as training. Early in the year, you may not be comfortable contributing; by the end of the year, you are. Investing time in your team this way often results in more original, meaningful contributions. (46:02)

Alexey: We have a question from Mikhail: any advice for someone starting their data science career and wanting to give their first conference talk? (46:37)

Vincent: Two things: first, imagine the reviewer’s perspective. Every conference has a committee deciding which talks are interesting. Second, write your proposal so it’s engaging. Would you attend this talk yourself? That’s the first filter. (46:50)

Vincent: Conference style varies. At PyData Amsterdam, the vibe was playful and community-focused. Silly but educational talks were encouraged. One favorite example: someone analyzed metal lyrics to find the “most metal” English words. They never thought it would be a Python talk, but it was amazing. I even offered to help them prepare for the stage. (47:28)

Vincent: Many people don’t realize their ideas are interesting. Even quirky observations about APIs or datasets can make good talks. The key is communicating why it’s engaging if read out of context. (48:33)

Vincent: Contributing to open source isn’t enough to consider submitting a conference proposal. If your work becomes open source, it’s great for employer branding. Talks showcase your skills and your company. (49:24)

Vincent: Even simple projects can make interesting talks. For example, scraping Meetup data, exploring outliers, and visualizing results with Neo4j can be educational and entertaining. Half the audience may already know the tools, the other half learns something new. (49:54)

Vincent: The best talks are entertaining while teaching. If you achieve both, that’s ideal. Showing how tools can be combined to solve problems is especially valuable for newcomers. (50:55)

Alexey: We have another related question: how can open source help you land a job or establish a company? (51:02)

Vincent: I’ll try to give advice, but I’ve been ridiculously privileged in the way I’ve been hired. It’s hard for me to know what it’s like to start out, because in the last ten years, every company that hired me, I didn’t have to talk to a recruiter it was always the CTO reaching out. That’s how the ball got rolling, so it’s hard for me to give proper advice. (51:16)

Vincent: In the end, it’s about showing the other party that you might be a team member who will make them better at something. I wouldn’t worry too much about joining an established company. There are lots of interesting ways to learn. (51:45)

Vincent: For example, I know someone who wanted to work somewhere to learn from others. They found a local municipality job with zero talent in that area. They realized that meant they would have control over what they learned and be paid to figure things out. So you don’t necessarily have to join an established company you can join something smaller and local and learn that way. (52:06)

Alexey: How does contributing to open source help you get a job? (52:47)

Vincent: It’s helpful. Back when I was consulting, if someone had contributed meaningful things to open source projects we used with clients, that showed they understood the tools. That alone isn’t enough to guarantee a hire, but it’s a plus. Don’t over-stress it. If you have family and don’t do open source in the evenings, that’s fine. Companies often value a capable, enjoyable colleague more than open source contributions. (52:52)

Vincent: Open source can help convince the other side of your capability. Giving talks or presenting interesting projects is another way to demonstrate skill. My most popular talk was on simple linear models, mostly linear regression. It got the most views ever. You don’t need fancy, state-of-the-art work; if you explain a problem well, it’s enough. Entertain and teach. (53:16)

Vincent: When reviewing talks, both teaching and entertainment matter. You don’t need to do both perfectly; even teaching alone is valuable. Sharing links to your talks is useful. (55:14)

Alexey: You mentioned that early in your career, CTOs reached out to you. How did that happen? (55:26)

Vincent: Luck played a large role. It usually started with meetups. I began giving free R trainings to app developers, teaching them how to analyze user data. People remember someone who helps them, and if the training was good, they would refer you. It’s about being visible and meeting people. (56:00)

Vincent: It’s not just open source; teaching, giving trainings, and being present in the community helps. When I started, I was the first person I knew to call themselves a data scientist. Skills back then were enough to get started. I had experience building recommenders for companies that didn’t have them yet. Timing and location mattered too being in Amsterdam helped attract speakers and attention. (57:14)

Vincent: Today, you might focus on free training for cloud tools instead of R. YouTube and online resources weren’t as abundant back then. I would negotiate with companies for rooms to give trainings on weekends, which taught life skills. (58:42)

Vincent: When starting talks, choose topics that energize you and are feasible. My first conference talk used World of Warcraft auction house data to find a hack for the game. Planning a topic in advance that’s fun and personal helps. The room was full, mostly active players discussing the analysis it was entertaining and unexpected. (59:44)

Alexey: We have a question about your role as a research advocate. How is it called elsewhere? (1:01:35)

Vincent: It’s similar to developer evangelism. When I joined, I didn’t realize “developer advocate” was a career path. Some companies have clear developer advocate roles. My marketing team has a senior developer advocate, Rachel, who was at Kaggle before. Certain companies expect developer advocates to attend big conferences, but expectations vary. (1:01:52)

Vincent: As a research advocate, I try teaching people how to make virtual assistants. I also experiment with automating tasks in Slack, turning markdown files into Slackbots. Sometimes it works, sometimes it’s crazy but it’s a two-way communication with researchers. They ask me to test ideas in the community, and I provide feedback. (1:03:44)

Alexey: Thanks for sharing your experience, even though we sidetracked from open source. (1:05:38)

Vincent: I learned a lot too, especially about naming open source libraries. (1:05:51)

Alexey: Thanks to everyone who attended the event. (1:06:04)

Vincent: Yes, thanks everyone. (1:06:13)


DataTalks.Club. Hosted on GitHub Pages. We use cookies.