LLM Zoomcamp: Free LLM engineering course. Register here!
Season 24, Episode 1

Competitions: Beyond the Kaggle Leaderboard | Tatiana Gabruseva

Show Notes

Links:

Timestamps

Click any timestamp to jump to that moment in the video

Transcript

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Tatiana’s journey from academia to staff software engineer

Alexey: Hi everyone, welcome to our event. This event is brought to you by DataTalks.Club, which is a community of people who love data. We have weekly events and today is one of such events. You can find out more about the events we have in the description. There is a link there. Then do not forget to subscribe to our YouTube channel. Pretty important. Join our community where you can hang out with other data enthusiasts. During today's interview, you can ask any question you want. There will be a pinned link in the live chat. I just realized I did not do this, so I will do it right now. I will put the link there. (0:00)

Alexey: Today it is very early in the morning or not very early. Usually by this time I am up, but I went to bed pretty late yesterday because I was preparing for a workshop today. Multitasking. Oh no. So rope thick. Okay, I am getting there. (0:37)

Alexey: Amazing. Now there is a pinned link in the live chat. In the live chat, you can use this link for asking questions. We have a guest that is coming to us over and over again. I am really happy about her appearance on this podcast interview, Tatiana, Tati. This week we will be talking about how ML competitions can help grow your career, how to turn technical work into real opportunities, and why evaluation and benchmarks have become such a big challenge in AI. (1:11)

Alexey: Our guest today is Tatiana. Tatiana is a staff software engineer in machine learning at LinkedIn where she works on recommender systems, multimodal alignment, and AI system benchmarking. She is a Kaggle competition master, the winner of the sound demixing challenge 2023, and the author of more than 27 peer reviewed publications. Welcome back to the podcast. (1:43)

Tatiana: Thank you for having me again and again. Very great inviting me. When was the last time we came here? I think we talked about doing AI, getting a job as an AI engineer, was it? What is the role of a staff at big tech companies? (2:08)

Alexey: Good that you remember it. We also had an interview about getting into data science and machine learning, correct? (2:27)

Tatiana: Yes, about how to transition careers, learning resources, and stuff. The last time was around 2023 actually. (2:40)

Alexey: Now I was reading your introduction and I realized that this sound demixing challenge 2023 sounds like something I do not know about. This is probably something we did not cover before. Maybe there are some people, I doubt that, but maybe there are some people who do not know you. You should fix this as soon as possible. After we finish this interview, you go to our website, you go to events, and then you type in Tatiana in search to see the episodes we had before. You will need to fix that. Right now, maybe you can introduce yourself and tell us about your career journey so far. You do not have to go into details, so maybe we focus on the last couple of years between our previous interview and now. For those who do not know about you, please tell us more. (2:47)

Tatiana: I am Tatiana. I work as a tech lead at LinkedIn. When we last spoke, I was an AI staff engineer at LinkedIn. Since then, I pretty much moved to the country, so I moved to Silicon Valley thanks to my wonderful company who relocated me. My title slightly changed to staff software engineer machine learning, but nothing in reality changed, just the title. (3:47)

Tatiana: Some exciting new projects came into my life. One of those exciting projects is named Nikita. It is a wonderful new neural net of about 100 billion neurons, we do not know exactly. He is now 14 months, so I have been on maternity leave for a year now. I just came back to work. That is pretty much a wrap up of the news since we last spoke. (4:18)

Alexey: Before that, maybe you can take a few steps back and also tell us how you ended up doing what you do. (4:52)

Tatiana: Good point. I originally started with a PhD in physics, which I did in Ireland, and then for a while I stayed in a career in physics, working on signal processing mostly for lasers. I was also developing some computer science methods, including machine learning methods, for processing those signals. Back then, the machine learning term was not really coined and it wasn't that popular, but physicists used it everywhere. You basically had to fit your experimental data to some theoretical curves to find out some parameters, which is the essence of machine learning as such. Speaking of machine learning and computer science methods, it means that I have been doing it from 2008 really, but it was applied to optical signals. (4:59)

Machine learning applications in physics and signal processing

Tatiana: Then I decided, after I had two maternity leaves in a row, to look around and get some change. That is when the AI field popped out. It became popular, and everybody was talking about it. I got interested, especially as I was on maternity leave for quite a while, five years in total for two maternity leaves. (6:01)

Tatiana: The first months gave me a bit of time to look around and try things, and that is how I got into courses on Coursera. I learned about AI and realized that what I had been doing all those years is actually very related to it. I started taking part in Kaggle competitions, was able to sell my results quite well, and landed my first consulting job. Then I got my first job in a hospital thanks to those experiences. Then I moved to LinkedIn in Ireland, and with LinkedIn in Ireland, I relocated in 2024 to Silicon Valley where everything is happening. So many people want to come here for career opportunities. (6:16)

Alexey: If anyone wants to learn more about Tatiana and her story again, if you go to our podcast and you look for transition from academia to industry as a staff AI engineer, that is one of them. Another one is switch to computer vision and deep learning roadmap, Kaggle projects, and mentoring. We changed names to reflect what we actually discussed, but the point is you just go to our podcasts and type in Tatiana. I do not think we had any other Tatianas on our interviews, maybe we should address that and find another Tatiana if you can introduce us. We talked in depth about how to do the transitioning and the role of the staff engineer, so these are really cool episodes. I remember the first one I did from the office of the previous company where I worked, which was really cool. (7:01)

Alexey: You mentioned Kaggle competitions, and in general competitions, and I know that for you personally, competitions were the way to get into machine learning. For me, I was already into machine learning, but what helped me was to get actually hands-on experience. Before that, I did my masters and I thought I knew everything about machine learning until I tried competing in Kaggle competitions. I realized that all those two years of my masters were useless because I could not do anything on Kaggle. I felt so embarrassed, but then I started doing more and more, and then I felt very comfortable. For me, all the machine learning problems I had at work were kind of easy. Why do you think they are useful for advancing your career? Today we actually want to talk about competitions too. Why is it a good idea to take part in competitions? (8:06)

Skill development and domain diversification on Kaggle

Tatiana: There are two components in it. One is learning. Especially Kaggle is the best for learning because it has the whole community around all the discussions. People show their best solutions as well. They have post mortems when everybody writes, usually those who are in the top write very detailed reports about what actually worked and what did not. You have a constant feedback loop through the leaderboard, which is gamification but also helps you to see how you are doing in the competition. You learn a lot and iterate through a lot of approaches quite quickly because people also create starter notebooks. Not only starter notebooks, there are people who really help you create the starter code that you can go through. (9:13)

Tatiana: This learning aspect is one thing that can help you to grow your career obviously, because you are just learning more about different approaches. The other aspect is how you leverage your participation in competitions, if it wasn't too bad, and it does not have to be winning by the way, in a way that can progress your career. Those are two separate things basically. For learning, I would say it is important to keep changing your domains. I have seen some people who just focus on one thing, let's say classification, and they create a lot of ensembles just classifying images. They win a silver medal, a silver medal, and a gold medal. They are a Kaggle master, but all they know is the classification of images. (10:05)

Tatiana: This may not be so helpful if you go to a wide variety of companies for interviews and they ask you different questions. As opposed to that, you can do things differently. You can focus not on winning the competition or getting a gold medal, but you do not repeat your domain. You do something for time series at the start initially to get a sense of different domains. You do something for computer vision, for segmentation, and for detection. You do something for 3D if you want to focus on computer vision. If you want to be more of a generalist, also do one for time series, for natural language processing, and keep it varied. (10:53)

Tatiana: Yes, you will not do that well compared to the person who is just focused on one topic, because that person will have all the pipelines ready for that topic. The amount of knowledge you are going to get will be pretty varied across different domains, and that helped me on different interviews myself. Because I encountered different interviews on different topics, I was able to discuss different topics, had some intuitions, and knew about up to date approaches because I kept varying domains. (11:45)

Alexey: I did the same. For me, it was basically looking at what is now active on Kaggle, let me take part there, and it was different domains. Then I think I stopped doing Kaggle because at some point it was just image competitions and I had no interest in that. You actually did have interest, and you took part in many things and mentioned that there are so many different domains. I am going off the script a little bit, but I am really interested in your opinion on that. You mentioned that there are competitions on AI. What are those competitions like, and what can you actually do on generative AI? It seems like now with LLMs, you can just ask an LLM to take part in a competition and it will, correct? (12:15)

Agentic AI benchmarks and automated competition entries

Tatiana: Yes, it will. There are papers like that when they are already getting some percentage of silver medals and gold medals. I have seen a paper where they show how an AI agent is doing on Kaggle competitions. Pretty soon Kaggle can be used actually as that agentic AI benchmark environment where different agents are used. Agentic AI competitions are a bit of a complicated topic because it is a model harnessed in the environment. Kaggle itself can be used as a platform for such agents, and there was a paper recently where they have shown quite a good percentage, I do not remember the numbers, but we can put that paper in the links later to find it. Quite decent numbers of silver medals are already achieved just by an AI agent, so it means that in six months we probably will have competitions between different AI agents on the Kaggle leaderboard. (13:35)

Alexey: Have you seen this idea from Andrej Karpathy about automated research? (14:15)

Tatiana: Can you remind me? (14:22)

Alexey: A few weeks ago, if you opened Twitter, this was all that people were talking about. Basically, the idea from Andrej was using LLMs to optimize a metric. You have a metric and then you do different experiments, and if the experiment succeeds, then you keep this. It is basically usual optimization. You do an experiment, you have a metric, you see if this metric value is better than the previous metric value, and you keep the result. Plus, he did a few smart things, like not allowing the experiment to run for more than two minutes and things like this. There are some heuristics on top of this. He called it automated research, and a lot of people picked this up. I was thinking, haven't people been doing this for ages on Kaggle? Optimizing to a single metric. (14:27)

Tatiana: Yes, exactly. It is the essence of AutoML. For me, it was like, don't people do this on Kaggle all the time, using an LLM for optimizing a metric? Overfitting to a single metric is one of the problems of Kaggle versus production. As you said, there have been times of AutoML when it was doing exactly the same. I think H2O did AutoML, and Anthony, who was back then CEO of Kaggle, did some submissions with AutoML. The thing is, it wasn't doing that good. It wasn't winning gold medals, as I remember. (15:29)

Tatiana: Now that results are becoming much better, and if you consider the current trend of how quickly things are improving with AI agents, it means that we have maybe six months before AI agents will be winning all the competitions on Kaggle. That is quite possible. (16:12)

Alexey: I was just thinking, what if I let my clone use the same approach, use something like this automated research? Just say, okay, you have only five attempts per day, use them wisely, prepare your cross validation, make sure it correlates with the leaderboard, and now go experiment. Then I would just set it free to do whatever you want. I would give it a proper environment, like a powerful machine, and say, okay go, now here is my API key for Kaggle, do not submit more than five attempts or you will be penalized. Come up with some wording like, you will be penalized if you do this often, so you have to establish correlation between the public leaderboard and your cross validation, so now go experiment. I am pretty sure with a few competitions, maybe four or five, you can master this framework to the extent that it will achieve decent performance, maybe not a gold medal or top 10, but still decent performance. (16:29)

Deep technical mastery versus leaderboard gamification

Tatiana: Decent performance, yes. I will link that paper where they got quite a lot of silver medals. Speaking of that, we started with using Kaggle as a learning platform teaching us humans about different domains. Obviously, if you do what you said, what will you learn? You will learn how to use code for taking part in competitions, but that is it. (17:43)

Alexey: I think maybe you will also analyze the results, like what Claude actually did. (18:08)

Tatiana: It is also learning, but I think if you want to touch different domains and understand in depth the data and all those steps, you can automate them nowadays, but I believe that some steps you should do yourself if you want to really learn something deeply. For example, I can ask Claude to set up Kubernetes and all that stuff, but it will not make me a Kubernetes expert. If you want to know something in depth, it is the same as Andrej Karpathy suggested. He implemented a neural net from scratch on NumPy. Did he have to do it? No, Claude can do that now, and PyTorch was available back then when he did it. He chose to do that from scratch for a good and deep understanding of things. (18:19)

Tatiana: So should you, or anybody who wants to learn. Even if code can do that for you, that will not bring you that level of learning. Follow Andrej Karpathy, who implemented neural nets from scratch starting from basic principles. I see a lot of similarity between those approaches and doing Kaggle yourself still. (19:08)

Alexey: Right now I am in Amsterdam. There is a conference in Amsterdam. I do not go to this city often, so this is my second time. I met people who lived there and we had a very interesting discussion. A lot of people at that dinner where we got together were into education, and we were thinking, okay, now everyone can use ChatGPT to learn about things, but people still take courses. In DataTalks.Club, we run a course about machine learning, and last year in September, LLMs were already quite good. Technically you can just use ChatGPT to learn a topic, but people were still coming to us to take courses. Why do you think it is the case that despite all this AI advancement, people still want to learn things themselves and also maybe take part in competitions? Why aren't we completely outsourcing it to an LLM? Why do we still want to learn it ourselves? (19:33)

Tatiana: Your courses are hands-on, that is why I like your courses. They are so popular because you come from this developer background into education. You have done a lot of things hands-on. You have been a principal engineer before you started to teach other people, so you know the value of hands-on work with different technologies, and your courses are purely hands-on. If you learn something with an LLM, you are missing that part. You probably had it as well in university. You read some theorems and you read some basics in mathematical analysis, but it does not settle in your head until you open Demidovich and you do those problems yourself. (20:45)

Alexey: Demidovich is who? (21:28)

Tatiana: It was a famous book. I know who that is, but probably not everyone has the same background as us, living in the post-Soviet space. It contains hardcore problems in high school math written by him, and you had to go through the problems yourself. That is how we learned calculus. (21:34)

Tatiana: The same applies here. If you just read about how things are done, it gives you an illusion of learning. Reading gives you an illusion of learning, while the real learning comes from implementation when you just sit down and go through it hands-on. Do that with your hands-on work, not Claude's hands-on work, because Claude setting up Kubernetes for me did not make me a Kubernetes expert. It saved my time, so if I do not want to be a Kubernetes expert, okay, I will take it. If I want to know something in depth and in detail, I need to be doing it myself to learn. Your courses are hands-on, which is a huge advantage, and nothing can replace that in terms of learning. You can read something, but as I remember from university times, it does not settle in your head and it does not become knowledge unless you go through the problems and solve them. Even in university or in school, it is the same as in courses, just learning by doing. (21:58)

Hands-on implementation and the illusion of learning

Alexey: It brings back what I mentioned at the beginning. I spent two years doing my masters, and I thought I was a machine learning expert until I tried competing on Kaggle. Then I realized that I knew nothing. I remember my first submission could barely get me into the top 66 percent. My first submissions failed and I felt so miserable, so embarrassed, and this was a very good motivation to actually do something. Because I was motivated to do something, I learned a lot. These problems in Kaggle or other competitions, and I also took part in other platforms which we want to talk about too, they give you a problem but they do not tell you how to solve it. This is typically not a textbook problem because in a textbook you solve by hand. You derive Naive Bayes, and now you know how the Bayesian rule works, cool, but can you actually apply it and do something useful with this? Usually the answer is no. Kaggle and other competition platforms give you that, so this is really cool. There is a comment that I see: "Kaggle competitions are getting unrealistically impossible." What is your interpretation, Tatiana, of this comment? (23:04)

Tatiana: People are still winning. I know people who just won some competition recently for 3D images. Maybe it means they are becoming more and more popular, so for other people it is difficult to compete. Kaggle has never been an easy platform to win, and I want to emphasize it because it was so popular. The way it is designed, unlike Topcoder for example and some other platforms, allows a certain degree of manipulation and even cheating. I would not expect completely fair competitions on Kaggle in some domains, as we have seen before, and I would not focus on winning Kaggle competitions because of that as well. (24:25)

Tatiana: Certain manipulations can be applied before the test set is opened. People could help themselves by making tailored rules for a particular image in that test set, for example, and that gives you enough advantage to win. Between the first place and the 10th place, the difference in the metric is usually tiny, it could be the case. People scrape additional data for training when it is not supposed to happen, we remember a lot of stories about that. Maybe let's not go too much into it. (25:22)

Specialized platforms and fair competition environments

Tatiana: Kaggle is the best platform for learning. If you want to win in a fairer environment and earn money, go to Topcoder. You will submit a Docker container, and they will run your model on the training data themselves. It might be not reproducible or it will depend on how you fix random seeds and manage that all, which is up to you as well. Then they will run inference, and all that will be under constraints of GPU time, and let the winner win. That kind of setting Topcoder had for SpaceNet competitions, for example, produces a much fairer environment in my opinion. (26:01)

Tatiana: People who really want to win and believe that they have good models and good knowledge of the domain do not always go to Kaggle. They go to those specialized platforms and specialized competitions basically to earn money because they know they are good. SpaceNet 9 was a 20,000 dollar first prize. Fair competition, and I think it is Docker based. When I was taking part in SpaceNet, it was Docker based where they would train on their data, so you cannot introduce additional training data and you cannot manipulate the test set. Lots of things are not possible, and it is not that popular to do all those private sharing channels because it is just Topcoder, which nobody cares about. (26:47)

Tatiana: Because Kaggle became so popular, there was even a scandal when in some countries they were selling places in teams for a certain price to join a team and get a silver medal on Kaggle. You pay for things like that to happen. (27:33)

Alexey: How much do you pay to join a team to get a silver medal? (27:51)

Tatiana: I do not remember now because it has been years, but it wasn't that much. Gold was expensive, but yes. (27:59)

Alexey: Gold was expensive, makes sense. For a gold medal, if you get a gold, you become a grandmaster or a master, correct? (28:06)

Tatiana: You need two silvers and a gold to become a master, but the gold is the most complicated obviously. (28:18)

Alexey: They have even been trading things like that. Does being a Kaggle master still hold any value? (28:23)

Tatiana: I do not think so. I can tell you when I got Kaggle master, it was around 2020 or 2021, it did not give me any career opportunities. (28:36)

Alexey: For you personally, did you feel like you made an achievement? (28:48)

Tatiana: No, I am quite indifferent to virtual medals. I am not a true Kaggler. I learned and as a result I got Kaggle master, but it was not really something that I craved a lot as some people do. I hoped it would give more career opportunities, but no. It is not the Kaggle master title that gave me career opportunities, but how I leveraged the competition results, and not necessarily the best competition results by the way. (28:54)

Alexey: I remember that if you want to get a gold medal, you have to assemble ensembles and you have to have a three-level approach. This is what most people are doing, and it is not creative work. That was my impression, and at the end, I kind of got burned out by this constant race. If you focus on learning, then it is different. You do not really say, okay, I am fine with silver or whatever model I get, and I do not really care about models. I really just want to understand the domain and see how people do things in this domain. If you completely disregard the competitive aspect, then it is a different story. (29:26)

Tatiana: That is the best thing, I think. Do not focus on the leaderboard at all, especially in the beginning. If you are there for learning, forget about the leaderboard and just try to make the best model you can, even a single model. It will not win maybe, but it is okay. You are not there for winning on the leaderboard, you are there for learning. Read solutions of other people, those who got to the top or those who did well in similar previous competitions. A lot of knowledge comes from discussion threads. (30:20)

Tatiana: People share valuable ideas in those discussion threads and they are quite well separated by voting. If you read highly voted comments in discussions, they are usually the ones that have some valuable insights. The same stays for notebooks; the highly rated ones will usually be useful to some extent that will help you to learn. Do not get into this gamification business because it will not necessarily be good. (30:55)

Tatiana: About ensembles, we actually entered one competition, my first competition, which I chose on physics because it was about the Large Hadron Collider. I got a bronze medal with no problem, though I could not understand the domain completely. The next one was on astronomy, about astronomical signals, and we got a silver medal, just out of gold on the private leaderboard. We were very close, just one position down below the gold zone unfortunately, with an ensemble of just two models, which were decision trees. I realized that provided valuable insights because when I read the solutions of the top teams, they had huge ensembles. (31:20)

Tatiana: They had huge ensembles and we had just two models. I summarized our findings about the process, features, and so on, and just wrote it as an arXiv paper, a simple arXiv write up as a technical report. I also did a technical report on Kaggle of course, but a bit more detailed for arXiv. Within a week, I got an invitation from a journal to submit a paper. I checked the journal and it was Q1. Q1 is the first quartile of all Scopus papers, basically the top 25 percent. That is considered to be a good journal. (32:08)

Tatiana: We had to work a bit more to make it publishable by that journal, but we ended up having a proper journal publication from a solution that was number 13 on the leaderboard. (32:45)

Alexey: Who else submitted a paper? (32:58)

Tatiana: The number one on the leaderboard. He was actually an astronomer. He was a novice on Kaggle, completely a novice, but he won because he was doing a PhD in astronomy. He knew what he was doing, unlike us who were learning astronomy on the fly. As a result, he learned how to apply machine learning to astronomy. I think he had domain knowledge better than everybody on the leaderboard, but maybe not that much knowledge on machine learning tools. He also wrote a paper later on that, and we wrote a paper. You do not have to be winning to produce some artifacts that can later help you. (33:01)

Tatiana: There was a continuation of the story a few months later. Based on my paper, I was invited to present in Dublin for an astronomical conference, but at that time I passed. It also brought some other invitations. (33:47)

Alexey: From what we can hear, and I can confirm, it brings a lot of opportunities. How can we use these competition platforms to create a portfolio? Many people are interested in progressing their careers or switching careers from one thing to another, like right now it could be AI engineering or ML engineering. How can we use competition platforms like that to create a portfolio that I can present to my potential employer? The things we talked about show that if you compete on Kaggle seriously, it is not really production ready, but if you focus on learning, then you can actually turn it into a proper portfolio project, correct? How do we do this? (34:10)

Tatiana: That is a perfect question. Your place on the leaderboard does not have to be first or even in the top to create a lot of very useful artifacts that can help your portfolio. For example, in one competition, the Lyft competition, I got a silver medal. Silver is not that challenging, it was around number 34 or something like that. I created a GitHub repository with a proper readme and description, well organized. I did not write a paper on that because I did not think that I used specific, publishable, innovative hacks like I did for astronomy where we did an innovative method. For Lyft, it was just a silver medal, but I created a clean GitHub repository. (34:59)

GitHub repositories and engineering portfolio building

Tatiana: In one of the interviews, I got an offer at the end for that company. They were looking through my resume and I wrote: "Top 5% in Kaggle Lyft competition. Here is the GitHub repo." They actually went to that repository, looked at it, and liked the approach. We were discussing in the interview how I approached that competition. They saw my code and that convinced them that I can do the job. I got an offer for a radar and lidar based startup basically. That is one offer brought by a GitHub repository that I just created out of a competition. (35:24)

Alexey: In another competition, I did not play on top, but my co author played on top. He had second place and he had some innovative ideas as well. I pinged him and said, "Why don't we write a research paper together?" One way is to do open-source GitHub, and you can do that for any place on the leaderboard. If you want to write a research paper, you want to assess your methods and see what is innovative, to see if you have something to offer. Maybe you used some augmentation that other teams did not use, maybe you found a better way to assemble different models, or maybe you put in some auxiliary losses that other teams did not put in. It does not have to be solo work; that can be combined. You can actually ping everybody who wrote a write up for the first, second, third, or fourth place and ask them: "Do you want to write a proper publication out of that? We can submit it to a conference." (36:23)

Tatiana: I wrote a publication for that pneumonia detection competition based on those kinds of hacks, tips, and approaches used, and I got into a CVPR workshop and presented at CVPR. What I did not do, but I should be doing more, is marketing. Once you get into something like a CVPR competition, don't stop there. Make a blog post, do marketing, and say, "Hey, I am presenting at the CVPR conference on that day. Please join. Here is my presentation." Do it through your LinkedIn and put it everywhere. Make a blog post in simple terms, because what you publish with CVPR may be a bit too technical for everybody to understand. Make an easier layman language blog post, put it on Medium, put it everywhere, and put it on LinkedIn. Spread the word and separate your results from artifacts. Those artifacts—the GitHub repository, publication, and presentation—if coupled with marketing and blog posts, will give you more opportunities than the winner of the competition will get. If you do it, you will get more opportunities than the winner. (37:28)

Technical marketing via blog posts and LinkedIn

Alexey: I am listening to this and I think, okay Tatiana, all of that is cool, but I would never be able to write a paper for CVPR. It sounds impossible, especially let's say I am considering switching careers, I have not started taking part in competitions, and you are now telling me that I can be submitting my work that I haven't done yet to an AI conference paper. How is it possible? (39:02)

Tatiana: I had quite a research career before that, so I had a lot of papers done, including AI journals like Physical Review Letters. I had this experience of writing in my career. I think here we need to have a body who has experience in academic settings, and I think you mentioned that in a previous competition, was it the Lyft one, you teamed up with somebody else and that person was a developer. Probably for that person, having you as a research partner was helpful. I was that research partner, and he wouldn't write papers before. That is quite a common practice if you think that you've got some interesting methods, something novel, but you don't know how to sell it well. (40:12)

Alexey: What if I don't have any novel methods? What if I just go and take part in a competition and get my silver medal? Realistically, what are my chances of submitting a paper? (41:06)

Tatiana: That is a good question. For a silver medal, average results do not mean it is a bad result if you did it with a single model. If you did something with a CPU and a single model for a silver medal, you're a star. That all depends, as opposed to all those ensembles and GPU heavy computations, but you just need to assess if you learned or if you used some innovative approaches. (41:17)

Alexey: How do I know if my approach was innovative? (41:51)

Tatiana: You search for what was done before. Did you create your own data augmentations? For example, in our paper, we had new data augmentations that nobody did before for that task. We did some auxiliary losses. It is a commonly used technique, but it wasn't back then applied that much to those images. We did checkpointing for ensembling. There were certain things that we discovered when I searched for those approaches, and I did not find many, or any, for those augmentations. (41:56)

Tatiana: We basically took 2D augmentations and created 3D versions of them from scratch. They didn't exist back then, now they exist, but before they did not, so that is also innovation. You can put it all in code, and you should put it in code because papers with code have high acceptance rates. Reviewers appreciate you open sourcing the code, and that gives you bonus points, and the code itself helps you with career opportunities. If you don't know how to write a paper for an AI conference, start by making a clean open-source GitHub project out of your competition, something that you would not be too shy to show to a potential employer. With the help of AI, you can now write a decent write up like a blog. Just try to not generate too much AI slop, please edit it so it sounds human. (42:34)

Innovative approaches for academic conference submissions

Tatiana: Make a nice write up post about that. If you want to go one step further and consider it as a research paper, maybe assess it. You can even now ask an LLM as a judge. Go to flagship models, give your write up, give your GitHub, and say, "Hey, do you think some of those approaches are innovative enough or good enough to publish a research paper? See what it tells you." Although it is actually prone to pleasing you. (43:25)

Alexey: I think you have to prompt it well to be critical and not to please you, because I heard stories on Reddit that an LLM convinced a person to publish something because the LLM said it was super innovative, but at the end it was not. (44:03)

Tatiana: You ask it to act more like it is criticizing you. (44:28)

Alexey: The point I was getting at is that I wanted to manage expectations a little. I do not think it is reasonable to expect that you will publish a paper from your first competition. In your first competition, you will fail miserably. In your second competition, maybe less miserably. Every time you take part in a competition, or do something for that sake, competitions bring you problems. I am diverging a little bit, but since I am teaching and I see what happens in our courses, we require people to do projects, and for many course participants, it is very difficult to find a project. Here you go to a competition and you do not need to think about what to do, you just see that there are these five active competitions. You ask yourself, what do I like from these five, and you just choose, or what am I scared of, and you go for the one that looks scarier, and then you learn. It is not realistic to expect that after this first competition you will publish something. It probably takes four, five, or six competitions before this can happen. One competition is two to three months, so realistically it is a year of work at least. (44:32)

Tatiana: It depends. My first competition took me a week and was a bronze medal. My second competition took me two weeks and was my first silver medal. It depends, I never spent two months on a competition, I simply didn't get into that. Many people spend two months of almost full time work on that, just not everybody can. One thing I wanted to point out is if you want to have a rated publication from a competition, even when you are somewhat experienced in competitions so you can score well on them and it is not your first competition, but you are not a good writer, the other way to go is competitions hosted by challenges that offer co authorship. Those are hosted often on AIcrowd. If you go to AIcrowd, you will see there are NIPS competitions and some of them give co authorship. If you do well enough, not necessarily winning, maybe in the top three or something, and you need to read the rules carefully, they will include you as a co author in the paper about the competition that will be submitted to a star conference like NIPS. (45:53)

Tatiana: There are also competitions hosted by other A rated conferences like CVPR, which I believe are hosted on Kaggle sometimes. They do not offer monetary prizes often, but they give you co authorship. That is the way to get into an A rated conference, which is more or less realistic after you get some experience in competitions. That is how I got published at NIPS too. (47:03)

Research challenges at NIPS and CVPR workshops

Alexey: I do not know if it was AIcrowd or something else, but it was a competition about click prediction and I was working at an adtech company. I was actually doing that at work and I just took a simple approach from another competition. It was funny because in this competition it was like 10 or 15 people competing. I just took an approach from my Kaggle competition that I did half a year or a year before that. I just took that approach, reimplemented it maybe in C++ or something, I do not remember, but I just took the approach that I had, tweaked it a little to be more performant, and I took first place. (47:21)

Tatiana: Congratulations. You got into the paper for NIPS, correct? (48:23)

Alexey: It was a visual inference and machine learning workshop or something. I wasn't the first author on a conference paper, but I submitted a report and then this report got cited because for those who organize these competitions, what they want to do at the end, because they are researchers, is publish a paper. That is their goal, and I just happened to be a co author of that paper. I cannot say this actually brought any opportunities to me, but it is kind of cool. (48:29)

Tatiana: Maybe you didn't need those opportunities, but now if you put in your resume that you had that NIPS publication, it will make your resume stand out compared to other candidates who do not have that line, especially if you want to apply for a research scientist position in industry. That NIPS participation can be quite a differentiator, and if you do it a couple of times, then it will be a differentiator. Those research organized competitions from NIPS, or for example the sound demixing challenge that we did not have when we spoke last time, it was organized by Sony, Mitsubishi, and Moises. They are actually companies. There was also somebody from Meta on the organizing committee. They were company organized, but still within Sony, it was the research department on music. (49:05)

Tatiana: When we got the first prize, we got quite high on the leaderboard. The best submission on the leaderboard for that track, I think, was ByteDance, which is TikTok, but they decided not to disclose their solution code. We were ready to disclose and show it, so the organizers were writing two publications in quite high rated Q1 journals about the competition, and they asked the top teams—of course us because we were the first prize winner, but not only us, I think it was the top five teams—to write a technical report. Then they used it in the paper and included the technical report adapted into their paper for the five teams. You had to be in the top five to be in the paper, and it is actually two publications in good journals where you become a co author automatically. That is how you can end up if you do not have good authoring skills, because what you write is basically a technical report that you can write without being an experienced academic. For CVPR, I was the first author, so I wrote it properly as I knew how to do that, but for that other competition, I did not even have to know how to write good papers to have those two publications. (50:08)

Alexey: It was the same for me for this NIPS competition. To be fair, as you were speaking, I was checking my GitHub and it took me two to three years in order to do this. I mean, from when I started looking at Kaggle to actually submitting this paper, it took a few years. It wasn't like my first competition, it was a journey. I already had some pipelines and some code that I could just reuse, so it was not the first time I saw these problems. That is why it was relatively simple for me, because I already participated in many competitions before. (51:32)

Medical imaging platforms and specialized recommendations

Tatiana: You were not doing that full time, correct? I know people, masters, who were competing full time and they would build almost a grandmaster portfolio, which is also possible if people work full time. Most of us are just mere mortals, we do not have this luxury of doing this full time because we have families to feed so we need to work. I see we have a few questions. One of the questions is: "What are some examples of other specialized competitions that you would recommend?" This question goes to our discussion of things outside of Kaggle. This Sony challenge was not on Kaggle, it was on AIcrowd. (52:51)

Tatiana: Check AIcrowd, that is a good source for research competitions. Also, there is one nice resource called Grand Challenges. If you go to Grand Challenges, academics put their competitions usually on medical imaging and other medical related things, and that is how you can find less popular competitions on challenges. Let me see Grand Challenges, I am checking it right now. I have not gone to that website for a while, so now I think I opened something else because there used to be challenges there on medical imaging mostly. There is grand-challenge.org, which is a platform for end-to-end competitions. This one is good because participation is quite small compared to Kaggle, and all the medical related things are research based. (53:11)

Tatiana: Often after the competition they will host a workshop, and to participate in that workshop you do not have to be a winner. You just need to have a more or less decent score, and after that workshop they will publish proceedings. Those proceedings may not be from a star conference, but they also have a star conference there, and you can publish quite easily even if you do not win the competition. The options are AIcrowd, Grand Challenges, and there were MICCAI challenges. (54:30)

Alexey: What is MICCAI? (55:06)

Tatiana: It stands for Medical Image Computing and Computer Assisted Intervention, which is the name of a conference. They have challenges as well. If you often go to CVPR or other conferences, you may see challenges for that particular conference, and you can participate in those challenges. If they are not on Kaggle or a famous platform, you will not get 1,000 participants usually. It is easier to score well because your competition is lower, and you do not have to be the winner to go to the final workshop, present, and have your publication. That can be the way into it. (55:12)

Alexey: Even if you do not publish anything at the end, you still learn something and this is what matters at the end. (56:03)

Tatiana: Yes. One thing I would say is that on AIcrowd, for example, you do not have that many fruitful discussions. For learning, I still would prefer Kaggle because people share more. There are notebooks and there are discussions. On the AIcrowd platform, there are discussions but they are mostly dead, you do not get that much knowledge from them. If you go into specialized, especially research based challenges, it gives you better chances to present your work at a conference to build your portfolio, but you are more isolated from the community. Do look out for posts because Kaggle sometimes hosts research competitions, not for money and not always featured, they do not even have medals as such, but they are hosted by researchers who then have a workshop, so you can consider those as well. Non-featured competitions on Kaggle do not have that fierce competition, and it is quite possible to score well and then present your work. (56:16)

Alexey: We have a very important question. Imagine somebody is listening to our conversation and thinks, okay, I understand now that competitions are cool, but I still feel lost when it comes to starting. What do I actually do now? What is the one thing you would recommend people who listen to this podcast do right now if they want to start taking advantage of these things? (57:16)

First submission strategies for beginners

Tatiana: If you are a complete beginner, start with Kaggle. That is the best for newbies because they do have all the resources and notebooks. Choose an active competition that is not going to end in a week, but is going to end in two months. Look at what domains you are more interested in and choose one of those. Go to notebooks and try to search for a starter notebook. Almost all competitions are guaranteed to have some starter notebooks. Look at the most highly rated, for example, or just search for "starter." Go into it, try to understand it, tweak a couple of things, read discussions, and see what ideas you can get from the discussions, or make an agent that would parse discussions daily and produce a summary of the most important insights. (57:46)

Tatiana: If you want to be a pro, ask Claude to implement those ideas from those discussions into that notebook. Try to go through that sample notebook first through submission. Submit, because I have seen so many people who will just study a lot of things but do not even end up submitting, because during that submission process you need to understand how to run that inference on Kaggle and so on. What is more, just go to the end and make the first submission. It will be at the bottom of the leaderboard, but it is okay. You already submitted to a competition, congratulations, that is your first submission. That is the way to start. Then you can iterate, and as I said, use Claude and leverage AI if you want to find the most useful insights from discussions. Leverage the same AI if you want to implement them in your notebook, or if you want to learn more, make it the proper way. (58:34)

Tatiana: Build a pipeline from scratch or with Claude. Open your private GitHub repository and make a proper pipeline that you can use to iterate on your solutions. Do not use notebooks, although Kaggle may force you to use notebooks for submissions, so use them for submissions, not for iterations, and try to get into the silver zone. If you are in the silver zone and you feel a bit lonely, start pinging people about 10 points above you. Say, "Hey, why don't we team up? I am very committed to the competition." Look for people who are more experienced than you, they could be a master or even an expert. People are more open than you might be thinking. (59:24)

Asynchronous collaboration and competition team dynamics

Tatiana: In my second competition, I did not know anybody. I teamed up with one guy from Australia, and we never met in person. I just pinged him in private and said, "Hey, I see you submit a lot of analysis." I could see he was doing a lot of notebooks, so he was very invested and interested. I said, "Look, I am also already in the silver zone on the leaderboard. Why don't we team up? It might be more interesting." He agreed, and I organized a Slack channel. I called our team "Day Meets Night" because when it was night in Australia, it was day in Ireland. Crazy. Day Meets Night, and we worked asynchronously in that Slack. Although he was actually a night person, I think he was a 24 hour person, so he would be writing a lot of stuff and we learned from each other. You can do that. (1:00:56)

Alexey: I remember when I was competing, people out of nowhere would reach out to me saying, "Hey, let's team up," but they didn't do any submissions, so I had no information to check how good they are. When such a request comes from a person who is near me on the leaderboard, then they have proof that they are willing to put in some work. If it is a nobody, I would decline because you should first put in some work and then reach out. This is the approach you just recommended: do not expect people to team up with you when you didn't do anything and they already are in the silver zone, because why would they want to have a free rider? Exactly. There were cases when people teamed up and then the second person did not do anything. I never had it, but I have heard stories like that, heartbreaking disappointment and all that stuff. You make your first submission and bring it to silver at the start of the competition. That is easy. If the competition is ending in two months, I can guarantee that by looking through discussions and asking Claude to improve it, you will get to silver. (1:01:08)

Tatiana: It requires time and dedication, but not necessarily experience, so it is fine to be a novice here, but you just need to be willing to put effort into this, to put work into it. My first submission took me like a week because I did not know how to operate the Google Cloud Platform to run my calculations, how to create a submission, and how to submit it there. It took me a lot of time and I was asking some dummy questions in the discussion forum, but people answered. Ask dummy questions on discussions, do not hesitate. (1:02:28)

Alexey: That is another action point from our discussion. First, just go to Kaggle and pick up a competition. If it is a fresh competition, even better, so you do not have this pressure, because in a fresh competition, most people who seriously compete join in the last weeks. Now you have some room to actually breathe there, explore, and not worry about your leaderboard score. Second, get a starter notebook or look at popular notebooks that get you to a submission, and make your first submission. Leverage your AI tools, or do not leverage your AI tools, to tweak it and change it to get higher in the competition. Then when you are in the silver zone, try to team up with somebody who is more experienced than you, unless you prefer to play solo, which is fine as well. (1:03:05)

Alexey: Hey Tatiana, we took a bit more time, but it is always a pleasure to talk to you. Congrats on having another neural network, the third one. It was a really lovely discussion, so thanks a lot for joining us today. I had to sleep a bit less, but it was totally worth it to have this discussion. I see that even though it is an unusual time, quite a few people joined, so I want to thank you all for doing that. It is a really good way to start the day. (1:04:00)

Tatiana: Thank you so much for having me, and let's share resources later. (1:04:36)

Alexey: Yes. See you around, Tatiana, and everyone, have a great day. (1:04:42)


DataTalks.Club. Hosted on GitHub Pages. Built with Rustkyll. We use cookies.