Trends in AI Infrastructure

Andrey's Career Journey: From JetBrains to DStack

Alexey: This week, we'll talk about AI infrastructure and everything related to it. We might touch on trends in AI infrastructure, but we'll see where the conversation goes. (0.0)

Alexey: We have a special guest today: Andrey, the founder and CEO of DSec, which is an open-source alternative to Kubernetes and SLAM. I’m not sure what SLAM is, but we’ll probably talk about that. The idea behind DSec is to simplify the setup of AI infrastructure. Before DSec, Andrey worked at JetBrains for 10 years, helping different teams develop the best developer tools. Welcome, Andrey! (0.0)

Andrey: Thank you, Alexey, for the introduction and for inviting me. I'm excited to talk about infrastructure and everything related to it. (2:06)

Alexey: We've known each other for quite some time, so it was long overdue to invite you. Thanks for accepting the invitation. (2:18)

Alexey: As always, the questions for today's interview were prepared by Johanna Bayer. Thanks, Johanna, for your help. Let's start with the main topic: AI infrastructure. But before we dive into that, could you tell us about your career journey so far? (2:18)

Andrey: Sure. I started my professional career as a software engineer, though back then, I didn't even consider it a professional career—I just enjoyed coding. I even skipped school sometimes to work on coding problems. (2:46)

Alexey: You started coding in high school? (3:07)

Andrey: Yes, exactly. I miss those days when I coded just for fun. (3:09)

Alexey: Don’t we all miss those days? (3:17)

Andrey: After that, I switched to professional software development and worked in different companies. One of the companies I worked for was Experts, based in Saint Petersburg. They made professional tools for traders, like for options and stocks, and I really enjoyed it. (3:20)

Andrey: Then, I joined JetBrains, which was basically a dream job for me as a programmer. I know many others feel the same way about JetBrains tools. When I was invited to join, it was super exciting. (3:20)

Andrey: I spent 10 years there, working with different teams. I started with IntelliJ, then helped launch a data group, which is an IDE for working with databases. I also worked with other IDEs, including helping launch GoLand as a Go IDE. Eventually, I joined the PyCharm team as a product manager. (3:20)

Andrey: I was working on DataSpell, which is a dedicated IDE for data analysis and data science. It’s also part of PyCharm, and this is where I was first introduced to machine learning. That eventually led me to leave JetBrains and focus fully on this area. (3:20)

The Motivation Behind DStack

Alexey: How did you decide to focus on this technology? What made you choose this problem to solve full-time? (5:09)

Andrey: That's a great question. There are a lot of factors, and it's certainly a topic we could discuss at length. But if I were to summarize a few specific reasons: (5:27)

Andrey: I remember, during my interviews with various ML teams, I often asked, "What’s standing in the way of your team?" There were many challenges, but one thing that stood out to me was that there are still two main ways to do machine learning: on-premises and in the cloud. (5:27)

Andrey: Some teams couldn't use either because of cost. On-prem infrastructure requires a large upfront investment, and you need to ensure you're fully utilizing the hardware, which can be difficult to predict. This makes it a risky decision for many companies. (5:27)

Andrey: On the other hand, cloud infrastructure is expensive, especially for cutting-edge AI development. Many companies are concerned about the cost. (5:27)

Andrey: The more I worked with different teams, the more I realized that there were ways to work around these issues. Tools like Terraform, Kubernetes, and Docker have made cloud development much more predictable and less complex. But the problem with costs still remains. (5:27)

Andrey: That’s why I thought there should be a solution that could greatly reduce the cost of ownership and simplify the process for everyone, which led me to start working on DSec. (5:27)

Challenges in Machine Learning Infrastructure

Alexey: Yes, there are existing tools for machine learning, like SageMaker, but as you mentioned, cost becomes a major issue. (8:25)

Andrey: SageMaker is a great counterexample. It’s one of the most mature MLOps platforms for working with AWS. However, it doesn’t address all the issues, which is why many people are hesitant to use cloud services in the first place. (8:35)

Alexey: I remember when I joined my previous company as a senior data scientist about six years ago. The first thing I wanted was a machine with GPUs, preferably with a couple of them. I wrote a proposal for this, and everyone approved it, but it never happened. The cost of ownership was too high—the machine had to be housed somewhere, and someone had to maintain it. (8:57)

Alexey: With the cloud, it’s often easier to just click a button, but then you get the bill a month later, and it’s much more expensive than expected. One of the first tasks I had as a senior data scientist was porting some code from SageMaker to Kubernetes. I guess that's how people used to manage it. (8:57)

Transitioning from Cloud to On-Prem Solutions

Andrey: Yes, and while many of these challenges are still relevant today, there are even bigger challenges ahead. The "ChatGPT moment" has introduced new issues, which makes AI infrastructure an even more important topic today. (10:00)

Alexey: How long have you been doing this? (10:27)

Andrey: It’s been about 2 years, maybe a bit more. (10:31)

Alexey: I think we've known each other for around 2 years, maybe a little longer. So, I’ve basically seen this start. (10:41)

Andrey: Yes, I was doing some experiments even while I was still at JetBrains. It became official when we did the first project with JetBrains. (10:50)

Alexey: So, the next question is: how did you begin working on AI infrastructure? But I think you've partly answered that already. You started at JetBrains, right? You saw things related to machine learning, realized there was a problem, and started working on it. (11:09)

Andrey: Yes, in a general sense. JetBrains is a very flat company where everyone has a dedicated role, but it’s close to programming. Many marketing people at JetBrains were originally programmers, which is how I got into marketing and product management. Everything at JetBrains revolves around developer tools, so your full-time job is to think about the developer experience—gathering feedback from the community, passing it to the development team, and spreading the word. For me, transitioning from JetBrains to AI infrastructure wasn’t a huge change. I was still working on developer tools, just focusing more specifically on AI infrastructure. My title might have changed, but not much else. (11:32)

Alexey: Well, now you have to figure out how to get funding yourself, right? Previously, it was different. (12:47)

Andrey: Especially working on open-source, yes. (12:53)

Alexey: Speaking of open-source, why did you decide to work in the open? I see many companies starting as closed-source but eventually moving to open-source. Why did you choose to follow this model and make all your code open from the beginning? (12:58)

Reflections on OpenAI's Evolution

Andrey: I think it’s a clear pattern. Many developer tools are open-source, and while we don’t always know the exact reasons, there’s a pattern to it. Of course, companies have commercial interests, and not many are fully nonprofit. OpenAI, for example, started as a nonprofit, but that changed. In the end, the model reflects the best way to achieve the company’s goals. Sometimes the goal changes, or the way to achieve it changes, and that’s fine. I wouldn’t argue about whether being commercial or nonprofit is better. We see that most companies eventually go commercial, and many leverage open-source to move forward. (13:29)

Andrey: When we started, we weren’t sure if open-source was the right decision. But the more we talked to teams using different tools, the more we learned. One of the main benefits of open-source, especially in infrastructure, is that it allows early adopters to give feedback, which helps improve the tool. Open-source is one of the best frameworks for this process. There are other ways, but for me, as someone who’s always worked with developer tools, it’s easier to communicate directly with developers and understand their problems. Open-source fits well because I can relate to the development team. (13:29)

Open Source vs Proprietary Models: A Balanced Perspective

Alexey: I don’t know the full story behind OpenAI either, but I think they initially released many things as open-source. GPT-2 was open-source, and they also released Whisper and CLIP. But when they released GPT-3, they realized it was a gold mine. They thought, maybe this is something we should keep closed, but then others started reproducing GPT-3 and matching its performance. Now, OpenAI releases something, and the open-source community tries to catch up. What’s your opinion on that? With closed-source solutions like OpenAI and GPT-3, which give great performance, versus open-source solutions, where you have many different models with various characteristics and use cases? (17:33)

Andrey: This is a big topic, and we could talk for hours about it. I’ve given talks on the comparison between closed-source and open-source models. It seems like many people are living in a bubble when they discuss which is better, proprietary or open-source. The world is much larger, and framing the question like that isn’t helpful. It’s not about what’s better; there are many factors to consider. It’s not even an important question to debate. (19:08)

Alexey: Okay, I’m just wondering where the industry is heading. (20:39)

Andrey: If you think about proprietary and open-source models, they represent two different types of businesses. Proprietary AI is a service, centralized with one company offering it. There are many engineers working on different aspects of that service, and AI, in this case, is not just a model—it’s a service that impacts all aspects of our lives. (20:43)

Monolithic vs. Decentralized AI businesses

Alexey: Right, so GPT is much more than just a model. (21:34)

Andrey: Yes, it’s more than just the model. It’s like a new version of Google. Google changed our lives because we can search for anything. Now, GPT is changing how we work, not just how we search. It’s a disruptive change. (21:37)

Andrey: Open-source AI is a different business. It’s decentralized, where various companies and stakeholders contribute to different aspects. For example, in banking, privacy and control are essential, and a monolithic AI model doesn’t fit. Open-source allows companies to protect their privacy and control. It’s also about competition. If businesses don’t protect their margins, they won’t be able to compete. Open-source AI is important because it allows companies to maintain their competitive edge. Whether open-source AI is better than GPT-3 doesn’t matter as much. It’s about the approach that fits the business model. (21:37)

Alexey: So, it depends on the use case. (23:45)

Andrey: Yes, the whole point here is decentralization. Open source is about decentralization, and it’s a mega trend that isn’t influenced by quality. The quality is a result of this mega trend. We see that open source models are better. It's not that people use open source models because they are better than GPT models; they are better because people need them. (23:51)

Alexey: But it's not that they are better in terms of performance; they are better in other aspects, like being able to control the data flow or hosting them yourself. (24:20)

Andrey: Yes, and that's when we can compare some aspects of models. But to me, it doesn’t make sense to compare them side by side. What makes sense for open source models is whether they are customizable, and that’s why we use them—they are easy to customize. What's even more important is that it’s becoming less of a "rocket science" process. The process of retraining and post-training is getting easier and simpler because of decentralization. Whether we like it or not, this trend is here, and we can’t influence it. (24:34)

Alexey: Do you know if big companies, like Meta, contribute a lot to the open source community, especially in AI, with models like LLaMA? Do they publicly share information on how exactly they train their models and what their AI infrastructure is? (25:36)

Andrey: I want to answer for sure, but I’d say yes. Typically, this is shared in a technical report. Even OpenAI, while not open-sourcing most of what they do, still shares some information on how they contribute to the community. This started even before, for example, with the "Attention is All You Need" paper from 2017. Even though it wasn’t open source, the paper on the transformer algorithm was. And then Meta went further, for example, with the LLaMA model. They made the weights available with LLaMA 3.1. Some people argue whether "open source" applies to models, and some prefer to say "open weights." To be honest, I’m not that picky. (26:03)

Andrey: Even OpenAI shared a lot of details about their post-training and pre-training processes. LLaMA also released a technical report on how they trained their models, providing a lot of details on the training process. This is also how the community learns. (26:03)

Andrey: This is a good example of decentralization—it’s not just about the morals, it’s about sharing technical details on how training and post-training were done, how many GPUs were used, and the architecture of the model. A lot of details are publicly available, and this is one of the best ways to learn. (26:03)

Andrey: Some might think technical reports are boring, but I personally encourage everyone to read them. They are actually very interesting. I’m a big fan of books, and while I used to read a lot of fiction, I now find reading fiction less exciting. I actually find technical reports much more entertaining. (26:03)

Alexey: Since you find these reports entertaining, I’m curious—what challenges do these companies face, and do these challenges also apply to smaller companies? Larger companies like Meta, Google, and OpenAI have different challenges from smaller ones, right? What are these challenges in general, and how do they affect trends in AI infrastructure? (29:06)

Andrey: I consider myself a generalist, so I’m interested not only in the technical side of things but also in other aspects. When you ask about challenges, there are technical challenges in every team, whether it's an infrastructure team, an AI team, or a data team. (29:44)

Challenges in training large AI models: GPUs and distributed systems

Alexey: Since we’re talking about AI infrastructure, let’s focus on that. To train a model, we need thousands of GPUs. How do we get them in the first place? How do we coordinate this? These are all questions we need to consider when starting such a project. (30:16)

Andrey: Yes, we need to think not only about technical problems but also financial ones. We know for sure that without GPUs, we can’t train models. For example, LLaMA 3.1 was trained using 16,000 GPUs. To put that into perspective, Meta likely has an order of magnitude more GPUs in total. If they used only a small fraction of what they have, they could train LLaMA 3.1. (30:35)

Andrey: The first challenge is infrastructure. However, I want to mention that while GPUs and money are important, they are not everything. There are other factors to consider. For instance, DeepSig recently released a V3 model in December. They used only a small fraction of what Meta used to train LLaMA 3.1, yet they trained a model that is significantly better in terms of benchmarks. (30:35)

Alexey: I’ve seen posts about this in my social media feeds just a few days ago. (31:57)

Andrey: Yes, DeepSig’s release happened at the end of December. They trained a model with a fraction of Meta’s resources, and the model performed significantly better in benchmarks compared to LLaMA 3.1. What I’m trying to say is that while GPUs and money are important, they are not the only factors. (32:06)

Andrey: But back to your question—retraining is a large-scale task involving many GPUs, and it’s distributed. Distributed training is challenging. If you speak to people working in this area, they will tell you how complex the process is. When you have tons of GPUs, you need to coordinate everything. If something goes wrong on one of the nodes, you have to deal with it. (32:26)

Alexey: The more GPUs you have, the higher the chances something will go wrong, right? (33:59)

Andrey: Yes, and you need to manage that at scale. There are many other issues to address, but handling this one is a major challenge. (34:05)

Alexey: Do you know what this actually looks like? There’s probably a computer with 4 or 8 GPUs, another one with 4 or 8 GPUs, and all these computers are part of a network. You need to distribute the training process across all these machines, where each computer has several GPUs, and each GPU needs to compute something and then send the weights or gradients back to a central location. Is this roughly how it works? (34:13)

Andrey: Yes, but like any complex problem, it can be split into smaller tasks and solved at different levels of abstraction. Generally speaking, there’s PyTorch, a framework mostly developed by Meta, designed for training. Distributed training is one of its main use cases. (34:46)

Alexey: So, this is what is used for models like LLaMA, right? It’s based on PyTorch? (35:25)

Andrey: Yes, but I’m basing my response on the latest LLaMA training report. Even though earlier versions were trained with something else, I don’t think PyTorch was used. (35:33)

Alexey: When we download models from Hugging Face Hub, we use the Transformers package, right? That’s based on PyTorch? (35:54)

Andrey: Yes, but the important point here isn’t that it’s PyTorch. It’s just that many people, including Meta, are using it. Google, on the other hand, isn’t using PyTorch, though I’m not sure about all the specifics of their model training process. But PyTorch is just a tool in the framework—beneath it, there’s a backend responsible for communication between the GPUs. One popular backend is called NCCL, which handles communication for distributed training. For example, when training frontier models, this backend can be a major source of frustration, leading teams to sometimes reimplement it from scratch to optimize the process. (36:03)

DeepSpeed's efficient training approach vs. brute force methods

Alexey: To summarize, there’s a trend in training large language models: earlier, it was mostly about using raw computing power, like Meta with its massive number of GPUs. They would throw a problem at a fraction of these GPUs, and they’d process it. But not every company has the resources that Meta or Google do. Smaller companies like DeepMind are now focusing on being smarter with their resources, rather than relying on brute force. We see the trend shifting toward optimization—how can companies train models without access to massive GPU clusters? (37:35)

Alexey: And then there’s the case where many companies aren’t training models but just need to use them. If I need a model and don’t have a specific use case, I could take an existing model and fine-tune it—or maybe I don’t need to fine-tune it at all. For many companies, especially those not AI-first, the challenges are different. They’re more focused on fine-tuning and serving the model. What do you think the challenges are for these companies that are just using models rather than training them? (37:35)

Challenges for small and medium businesses: hosting and fine-tuning models

Andrey: Correct, although I’d be cautious about labeling companies as small or medium. I think it’s more about whether a company is AI-first or not. Once you figure that out, everything becomes much clearer. If a company is AI-first, they’re likely to customize models to optimize performance, and the choice between pre-training or fine-tuning depends on their resources. If a company isn’t AI-first, like a large bank focused on privacy, they might not need to dive deeply into AI but still want to use it in their services. Some banks may even decide to become AI-first, just like how many banks once went software-first or mobile-first. (39:30)

Alexey: Becoming AI-first requires a complete shift. To do that, you’d need to rethink the entire structure of the company. (40:43)

Andrey: Yes, some companies, like certain banks, may decide to become AI-first. They’d need to define what that means for them. For example, you might have banks that are not AI-first, but then new AI-focused banks emerge. But to get back to your main point, the challenge for most companies is that they don’t need to train models from scratch. Instead, they focus on customization and fine-tuning, especially in industries like finance. They might not have the resources for extensive training, but they’ll want to leverage AI in their operations, and fine-tuning helps with that. Open-source solutions are making this process easier, enabling companies to use existing models and tools rather than building everything from the ground up. Even AI-first companies will use these open-source tools to optimize development and efficiency. (40:50)

Alexey: So, for these companies, the majority are non-AI-first companies, looking for existing solutions rather than building everything themselves? (43:52)

Andrey: Interestingly, those companies that initially built their own models have started switching to existing solutions. They realized that maintaining their custom-built solutions wasn’t sustainable, and the open-source alternatives are more efficient. (44:12)

Alexey: Exactly, they try to implement something themselves, but then find that other solutions can solve the problem more effectively. Once they realize that, they prefer to adopt a pre-built solution instead of maintaining their own. (44:22)

Andrey: Yes, absolutely. Everyone is looking for ways to improve what they already do, but with less effort. But when it comes to AI tools, it's not just about being AI-first or not. For example, take Kubernetes. We see it as a foundational platform, not just for AI-first or cloud-first companies, but for everyone. It’s universal and adaptable to various use cases, whether expert-level or beginner-level. It’s about reducing the cost of ownership by investing in one universal tool. This approach is challenging to implement, but it's what we strive for—making it flexible and accessible for all types of use cases. (44:33)

Alexey: Okay, but if we already have Kubernetes, why do we need another universal tool? I remember back when Kubernetes was mentioned, I was intimidated. I didn’t want to go near it, but once I understood how it worked, it turned out to be much easier. The main challenge is that not every company has the team to manage Kubernetes. (46:28)

Managing Kubernetes challenges for AI teams

Andrey: Yes, exactly. This is one of the topics that requires more discussion. But at a high level, we’re focused on teams that are constrained by Kubernetes or other orchestration tools. Many teams experience specific pain points when working with Kubernetes, especially for AI use cases. Kubernetes wasn’t designed with AI in mind—it’s optimized for containers, but AI workflows often involve more complex processes like training. For instance, AI engineers don’t typically think in terms of containers; they think in terms of nodes and GPUs. This is why specialized tools like SLURM exist. It simplifies the process, and engineers prefer it because it’s tailored for AI workflows. This is one of the challenges we’re trying to solve: rethinking container orchestration for AI. (47:16)

Alexey: So, the problem with Kubernetes is that it’s not optimized for AI use cases? (47:44)

Andrey: Exactly. Kubernetes is great for general container orchestration, but it doesn’t handle the specific needs of AI workflows. That's why we believe it’s time to rethink how container orchestration works for AI. We’re trying to create a tool that addresses these specific needs while remaining flexible and universal. (47:48)

Alexey: As a software engineer, should I stay away from Kubernetes, or is it still a good tool to have in my toolkit? (50:59)

Andrey: It's the only tool when it comes to deployment. (51:10)

Andrey: It is the only one. Regardless of whether you use AWS, Azure, or even Alibaba Cloud, you end up using Kubernetes. (51:16)

Alexey: I personally don't use it, but the projects I work on are smaller. I don't want to pay for a large Kubernetes cluster, but for companies with more than one person, it probably makes sense. (51:31)

Andrey: Yes, there are edge cases, but I’m speaking in more general terms. (51:46)

Alexey: Here's a question: Do you think the future will be a hybrid of bare metal and cloud, or will it be cloud-only? (51:56)

Hybrid vs. cloud-only infrastructure

Andrey: Predicting the future is not easy, and some people enjoy making these predictions. (52:06)

Alexey: If we extrapolate current trends, though... (52:17)

Andrey: Cloud is the trend, and it’s the only trend. (52:21)

Alexey: Is it because people don't want to have a GPU machine under their desk? (52:32)

Andrey: I don’t know, 16,000 of them... It's more predictable for enterprises to use cloud. On the other hand, AI is a wildcard here. Nobody really knows what will happen. Many companies are currently investing in on-prem solutions due to AI. We’re seeing this trend, especially in AI-related fields. But I’m not an expert in that area. Personally, I prefer not to use the term "on-prem." It’s a confusing term. For example, you could have your own rack in your building and call it on-prem, or you could call it a data center. But when you refer to it as a data center, it could also be called a cloud. There are many difficulties with these terms. That's why, when I speak to myself, I avoid saying "on-prem" or "cloud" and just refer to them as different versions of cloud. (52:38)

Alexey: So, what I think... (54:28)

Andrey: Yeah. (54:28)

Alexey: When I think about on-prem, particularly for data teams, data science teams, and ML teams, I recall my first company in Germany. We had a machine with GPUs, and everyone had access to it. We would SSH into the machine, but then we had to coordinate GPU usage. If someone was using a GPU, others had to wait. In the end, it became a nightmare to coordinate. That’s what comes to mind when I think of on-prem GPU machines and the challenges involved. (54:31)

Andrey: Yes, you're right. On-prem means dealing with a lot of challenges yourself—maintaining the servers, managing updates, and orchestrating everything. With the cloud, much of this is handled as a service, but with on-prem, it's your responsibility. (55:22)

Alexey: It’s kind of like on-prem when you rent a machine from a remote provider but still have SSH access. For example, there's a provider in Germany called Hetzner where you can rent a machine with GPUs or powerful CPUs. You get SSH access to that machine, but it’s still remote. You’re still dealing with the same challenges we mentioned before. If you have a team of ten people and just a few machines, coordinating GPU usage for training models becomes a hassle. That’s another form of on-prem, right? (55:52)

On-premise vs. bare-metal solutions

Andrey: Yes, that’s one way to look at it. (56:46)

Alexey: Bare metal, right? (56:51)

Andrey: Yes, bare metal as a service is another option. Some companies offer bare metal as a service, where they handle the provisioning and firmware updates for you. But if you want to run a service yourself across multiple bare metal providers, you'll need to automate the process and ensure everything stays up to date. It's a shared responsibility between the provider and yourself. (56:53)

Alexey: So, the best example of on-prem is a GPU machine under my desk? (57:48)

Andrey: By the way, we didn’t talk about edge computing. We can discuss that briefly if we have time. (57:53)

Exploring edge computing and its challenges

Alexey: We have time for another 5-10 minutes. (58:04)

Andrey: One last topic: edge computing and how it differs from cloud computing. (58:07)

Alexey: What exactly is edge computing? Is it similar to devices like this one? (58:18)

Andrey: Well, to be completely fair, there might be experts who will correct me, but based on what I know, there’s no universal agreement on what exactly constitutes edge computing. I’d even say there’s a lot of confusion around it. (58:23)

Alexey: Most people might associate it with something like a Raspberry Pi, right? Or perhaps something else? (58:47)

Andrey: Yes, edge computing can refer to any customer-facing or site-facing device. It’s essentially any device that’s not in the cloud, but some people even consider certain cloud services as part of edge computing. For instance, edge AI is often mentioned by cloud companies, but it's essentially just normal cloud computing with localized compute capabilities. Using AWS, for example, could be considered edge computing by that logic, because you have AI in your region. On the other hand, edge can also refer to devices like mobile phones, laptops, smart home devices, video cameras, or drones flying around. (58:52)

Alexey: Hopefully, nothing is flying around right now! (1:00:00)

Andrey: Maybe not, but the point is that edge refers to remote devices. These devices often require smaller models because it’s difficult to run large-scale models on them. (1:00:04)

Alexey: I think there are companies doing federated learning, right? For example, if there's a customer-facing device like a drone or a probe, you can't send all the data somewhere for training. Instead, the training happens on the device, and then the results are centralized later. Is this something that applies to LLMs or AI in general? I know it’s used in manufacturing setups. (1:00:30)

Andrey: Well, many people might disagree with me on this, but I would say that federated learning is a very niche use case. It's often debated, kind of like the discussion around 5G versus cloud computing. It's basically a distributed computing topic. Today, we call it distributed compute rather than federated learning. While it does share similarities with blockchain and decentralization, it can sometimes feel like a religious belief in the tech world. There are evangelists who promote the idea that everything should be decentralized, but even with blockchain, we’re not there yet. We see some things moving in that direction, but not everything. (1:01:04)

Alexey: Yeah, I’m not really following that space, but it’s interesting. Well, maybe just one more question for you. (1:02:35)

Andrey: Sure, sure. Closing this topic down, distributed computing is a big area, and there are a lot of experts who really believe in it. (1:02:42)

Alexey: So, last question for you. You mentioned you like science fiction. What’s your favorite book? (1:02:51)

Andrey: That’s one of the toughest questions. It’s much easier to talk about challenges in distributed training than to pick just one book! (1:02:59)

Alexey: Well, pick one if you can! (1:03:10)

Andrey: Alright, if we’re talking science fiction, I’d definitely say The Three-Body Problem. (1:03:11)

Alexey: Yeah, I’ve heard of it. (1:03:28)

Andrey: It’s by Liu Cixin, a Chinese author. I’m probably pronouncing it wrong, but he’s well-known in the sci-fi world. Even if you're not into science fiction, this book is widely recognized. (1:03:34)

Alexey: I haven’t heard of him. I’m not really into science fiction, but I did read Ringworld a year ago, which was interesting. I’m looking to expand my reading, though. (1:04:02)

Andrey: I totally recommend The Three-Body Problem. It’s actually a trilogy, not just one book. The first book is named after the three-body problem in physics, which refers to a mathematical challenge of predicting the motion of three celestial bodies, like three suns. (1:04:16)

Alexey: I’m looking up the article on Euler’s Three-Body Problem now. It’s a difficult problem in physics and astronomy. (1:05:05)

Andrey: Yes, the book goes beyond math, though, and explores philosophy, politics, and existential problems. It’s a great read for anyone looking to kill some time. (1:05:20)

Alexey: Sounds interesting! Thanks a lot, Andrey. We only touched on a fraction of the topics we wanted to discuss today, which is no surprise, given how much we wanted to cover. But it was great talking with you. Thanks for accepting the invite, and I really enjoyed our conversation. I’m looking forward to working more with you in the future. (1:05:38)

Andrey: Thank you! It was a pleasure, and I look forward to it as well. (1:06:04)

DataTalks.Club

Trends in AI Infrastructure

Transcript