The Kaggle Book

Ashish Lalchandani

Hi Luca Massaron Konrad Banachewicz, thanks for being here! My question is, how to get started with Kaggle? Is it something like Leetcode where there’s a problem statement and we start coding? As a beginner, do we just dive into ongoing competitions or practice a bit with older competitions? I am really wondering on how to get started with Kaggle, want to test my knowledge of basic concepts, and learn new things by reading others’s code, but don’t know how to get started.

Luca Massaron

Hi Ashish Lalchandani! We are really glad being here presenting the Kaggle Book as book of the week!
Breaking the ice is one of the difficult things with Kaggle. I suggest you first should try participating in introductory competitions such the Titanic one (https://www.kaggle.com/competitions?hostSegmentIdFilter=5) or, since they are more lively and bit more challenging to Tabular playground competitions (https://www.kaggle.com/competitions?hostSegmentIdFilter=8). By doing that you can get rapidly a working knowledge of how Kaggle works and find your best way to enjoy about Kaggle and that’s not necessarily by competitions, you can participate in many ways, by writing your code doing eda or modelling, by studying others’ notebooks, by discussing on the forums about the possible solutions to the challenge, It is all up to you what you find more useful for learning.

Ashish Lalchandani

Thank you very much Luca! Thanks for providing the links too, much appreciated! I usually practice on jupyter notebook/google colab, and am really excited to compete on Kaggle, but never knew how. Thanks for providing your valuable insights! And all the very best for your book!

Alexey Grigorev

What do you think is the biggest disadvantage of kaggle?

Luca Massaron

Too much gamification sometimes leads participants to find shortcuts: voting rings, leaks, reverse engineering, various tricks that are far away for the real world practices.

Alexey Grigorev

And what’s your favorite part about it? =)

Luca Massaron

Surely the learning part and the opportunity to try anything that comes to my mind in order to solve the proposed problem :-)

Alexey Grigorev

What do the winning solutions use these days? Is it still stacks of xgboost models or something else?

Luca Massaron

When it comes to tabular data problems, ensembles are still the best solution in most cases. In the majority of competitions, which are deep learning ones, usually the SOTA / most recent architectures tend to win if used properly and are, again, ensembled.

Leo

Hi Luca Massaron Konrad Banachewicz. Thanks for bringing the book.
My question is that Kaggle competition is sometimes quite different to real-world data science and machine learning projects. Do you have some suggestions on how to gain the domain knowledge faster and transfer the experience and learning from competition to use in real-world projects?
How long or how many competitions will make a new comer to become a Kaggle expert or even master level? What’s the roadmap like?

Luca Massaron

Actually Kaggle is not all that far from the real world experience. The problems and used models are similar, if not the same ones. By tackling some competitions, you will learn a lot of practicalities about the domains that are topics for the competitions and you will get a sense of what could work and couldn’t in a specific situation. You will be shown the best performing technicalities and a lot of ideas to be applied in real world problems just by participating to a single competition!

Luca Massaron

as for as the progression in Kaggle rankings, the progression system has some requirements: https://www.kaggle.com/progression, thus for instance it takes at least a few competitions with the right medals to achieve the level you want. Said that, in my experience is just a matter of time and efforts you put on it. The more time and efforts, the less it will take as a general rule but the exact time depends also on your starting level of skill, how fast you learn, and also if you have good machines at your disposal for fast experimentation and prototyping. Since you learn by doing, also doing faster (because your machine can train or process the data in a faster way) will accelerate your learning.

Leo

Luca Massaron Thank you.

Michal

Hi Konrad, Luca,
Great book, I really enjoyed reading it. I found it highly practical, and learned a lot of new techniques! (Keep me out of book giveaway 🙂)

Do you think people can compete successfully (top solutions) in Kaggle with limited resources (standard computers)?
What is your opinion on AutoML in Kaggle competitions? Can winning solutions be based only on AutoML algorithms? (assuming feature engineering is done by people)

Luca Massaron

Hi Michal, I am super glad to hear that you enjoyed our book and you found it useful for you! Great!
Coming to your questions:

some competitions can be cracked using just the Kaggle notebooks or Google Colab. And sometimes even in more complex competitions, such as the M5 (https://www.kaggle.com/competitions/m5-forecasting-accuracy), that apparently seem demanding huge resources, you can actually find top-10 solutions that do not need more that the resources that Kaggkle freely offers to you. Don’t get stopped by not having a powerful computer at home or not wanting to rent a machine on the cloud!
as in real world problems, automl can be useful in certain Kaggle competitions, for instance for blend or stacking. The very problem is anyway that, if automl works well in a competition, you cannot expect you being the only one having access to it. When a resource is freely available around, it cannot provide much competitive advantage!

Tânia Ferreira

Hello Luca Massaron and Konrad Banachewicz, congratulations on the book and thanks for the Q&A! 🙂

What were the main lessons you learned from this journey to become Kaggle Grandmasters?
In your opinion, can a candidate for a Data Science job offer stand out by having the challenges made in Kaggle in their resume/portfolio?

Luca Massaron

Hi Tânia Ferreira!
Actually I learned so much from Kaggle, it is difficult to write everything down! Generally speaking, I can say that above all, Kaggle helped me to have a more practical approach to problems, to experiment faster and in a more effective way, to figure out solutions or methods that I wouldn’t have considered or known otherwise just by reading books and by self-experience.

Luca Massaron

As for as the porfolio question, yes definitely, especially if the Kaggle competition is on a problem related to the company you are presenting for being hired and if you can demonstrate having done some original and careful work in that challenges. There is no need having done many challenges or having scored very high in the leaderboard, all you bneed to demonstrate by such a portfolio is your passion for data science, your skills in coding a solution, your attention and efforts on a specific problem.

Tânia Ferreira

Luca Massaron thank you very much :)

Akmal Azzam

Hi, i wonder if someone be kind and could send Book Contents, so we have a bit global picture.

Luca Massaron

Hi Akmal Azzam, you can the detailed list of contents, together with a sample of the first chapter, here:
https://www.packtpub.com/product/the-kaggle-book/9781801817479

Akmal Azzam

Thank You for Sharing!

Carlos Orjuela

Hi Konrad, Luca,
thanks heaps for taking the time to answer. I’d like to ask you how to tackle imposter syndrome within Kaggle itself 🙂… which might sound counter intuitive. I’ve tried a couple of times to go into some competitions that were supposedly aim for newbies, but already found outstanding solutions using ensemble methods for example, which are bit daunting to compete with

Luca Massaron

Hi Carlos Orjuela! I can understand how you feel. My start at Kaggle has been a real disaster and it took time before I could climb the rankings. Naturally (I didn’t know that at start) you first need to dive deep before climbing up (because you have to learn how to compete, something we had in mind when we wrote the Kaggle Book)! Of course seeing your rank being not so high can be disappointing and may prevent you to invest more time and efforts on improving it. But wait a moment, that could be a problem if you just focus on rankings. In reality rankings are relative, ephemeral and often they are not so much considered out of Kaggle. Instead, focus on the fact that learning is the key. Try participating for learning, just take part in discussions and notebook building and don’t care too much in rankings for a while. Very soon you will have gathered enough competiting skills and knowledge that you will rank higher in competitions without much effort. Just refocus on learning for now!

Carlos Orjuela

Awesome, much appreciated 🙂 Luca Massaron

Bharat

Hi Luca Massaron and Konrad Banachewicz congratulations on the book, and thanks a ton for being here for the AMA! (Ask us anything maybe?) 😄

How much of the real-life data science workflow does participating in a Kaggle competition really simulate? (This question is probably oft repeated, but your unique perspectives as Kaggle grandmasters would help us) I’m not surprised if it has it’s own chapter /discussion in the book 😛
How much are Kaggle Kernels/notebooks of the winners and other grandmasters, as well as starter and EDA notebooks, representative of the domain knowledge required for a Kaggle competition? Rather, how would you go about acquiring domain knowledge for a Kaggle competition, as that is emphasized for success in Data Science?

Luca Massaron

Hi Bharat and thank you a lot for your words!
As for as your questions, I have to say that:

Kaggle resembles a part of real-life data science: the problems you tackle are real ones, the models you use in competitions are what you would use also in real settings and all the problem solving skills you get from participating in Kaggle are solid and true in every situation!

Luca Massaron

Even if you don’t know much of the domani knowledge, you can always keep up by reading the forums and researching blogs and papers on the topic. That’s should be enough even for very specialistic domains! Ah, and forget to look for similar Kaggle competitions, you find a lot by reading them!

Evren Unal

Hi Luca Massaron,
I am used to debug the code to better understand it.
Does the kaggle platform provide any debug option for python code?
If so,
Have you teach it in your book?

Luca Massaron

Hi Evren Unal! They provide a notebook environment, there you can use the debugging strategies that are commonly used with notebooks and colab.

insop

Hi Luca Massaron Konrad Banachewicz
Thank you for the awesome book. I like your approach describing kaggle problems together with the real world problems, such as comparing evaluations and metrics.
My question is do you have any recommendation competitions worthwhile to review for the learning purpose?
Thank you so much,

Luca Massaron

Hi insop and thank you for your compliments! Since I work a lot with tabular data, I would suggest the Tabular Playground Series (https://www.kaggle.com/search?q=tabular+playground+series), but I am a bit biased for that! What is your work or research interest at the moment?

A Joseph

Hi Luca Massaron,
Thanks again for the awesome book. One thing which always confused me is the selection of a particular dataset when there are a number of similar datasets available for a particular problem/topic. Do you have any suggestion or any filtering criteria you personally use when it comes to selection of a dataset such as usability score etc?
Thanks again!

Luca Massaron

Hi A Joseph, thank you for kind words! When I had the chance to use a Kaggle dataset for a problem (or as an example when writing a book’s chapter) I tended to evaluate its quality directly, by putting on the table many considerations in respect of the purpose I need it for. Therefore I often made some EDA from data and tried to build a basic model to check if the dataset really answered to my needs! Another aspect to care about when using Kaggle datasets is the fidelity and if the data is maintained. Fidelity is an issue if the data has an original source we expect it represents (in many cases the dataset is a mirror of another source). Maintenance is instead necessary if the data is representing some on-going fenomena (such as a time series, for instance).

A Joseph

Cool. Thanks a lot 👍🏼

Prashant Choudhary

Hi luca magnasco
Pretty interesting book. My question is what are some efficient ensemble models/approach where someone combined multiple models that are different in nature to each other? Feel free to refer any book or article for the same

Luca Massaron

Hi Prashant Choudhary! In the Kaggle book we have an entire chapter dealing with that (blending and stacking), I cannot but refer to that specific chapter for figuring why, when and how to ensemble models!

Luca Massaron

p.s. we did a lot of research on that, you can find all the state of the art written down 🙂

Prashant Choudhary

Awesome!🙂

Muhammad Awon

Hello Luca Massaron, congrats on the book. Looking at the content, I see you explain every aspect of how to ace Kaggle competitions and highlight the key elements of machine learning. I’m just curious, which one topic would you like to add to the book if you have the opportunity again?
Secondly, since the book is about the practicality of Data Science and Machine Learning, therefore, would you recommend it can be part of the high school curriculum for aspiring young students?
Thanks for writing a comprehensive and invaluable book.

Luca Massaron

Hi Muhammad Awon! At a certain point the book was growing so thick that we had to stop! I regret not being able to put into it all the interview I wanted and not having space for chapters showing how to practically reproduce Kaggle solutions to most renown competitions. By replicating top solutions you can learn really a lot and it can accelerate your becoming a great Kaggler!

Luca Massaron

The Kaggle Book has many facets. Yes, it can be used as a text for applied machine learning, too 🙂 You won’t find any other book around with so extensive chapters on practical topics such as ensembling or doing proper cross validation.

Muhammad Awon

Very insightful. Thank you.

Paul Priest

Luca Massaron Any fun or niche areas where you thinking Machine Learning has yet to be applied but could be useful? I’m a big fan of food and cooking and I think there are some interesting possibilities for creating new recipes…though the results might be subjective 🤣

Muhammad Awon

As an Ex-chef myself, we share the same interest 😃

Luca Massaron

Any field can get advantage from machine learning, actually, you just need to have the imagination to fit it into any existing process or figuring out new processes leveraging it!

Tim Becker

Luca Massaron, congratulations on the book. I was wondering how much time you usually spend on a kaggle competition. To me the time commitment is a bit scary, because it seems to be a mayor task to create anything remotely acceptable. Or is it less time consuming than it seems?

Luca Massaron

It depends on the purpose I have for that competition. If I aim for a top ranking, that could imply spending all my free time for a long time, one month, one month and a half. If instead I just plan to learn from the competition, by trying refactoring some notebooks or implementing some ideas of my own, I could spend as little as 4-8 hours on the competition.

Tim Becker

thank you

DataTalks.Club

by Luca Massaron, Konrad Banachewicz

The book of the week from 19 Sep 2022 to 23 Sep 2022

Questions and Answers