The Future of AI Agents | Aditya Gautam
Listen to or watch on your favorite platform
Show Notes
Links:
Timestamps
Transcript
The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.
Aditya’s from embedded systems to AI
Alexey: Hi everyone, welcome to our event. This event is brought to you by Data Talks Club, which is a community of people who love data. We have weekly events and today is one of such events. If you want to find out more about the events we have, there is a link in the description. (0.0)
Alexey: Click on this and check it out. You will see all the other events we have in our pipeline. Do not forget to subscribe to our YouTube channel. Please keep subscribing because we are happy about every single one of you joining our Data Talks Club community. (18.0)
Alexey: This way you will stay up to date with all the future streams we have. We also have an amazing Slack community where you can hang out with other data enthusiasts. The link is in the description, so check it out and join our Slack. During today's interview, you can ask any question you want. (35.0)
Alexey: There is a pinned link in the live chat. Click on that link, ask your questions, and we will be covering these questions during the interview. I already see there was one question in the live chat. Please use the Slido link for that. (1:00)
Alexey: Right now I am going to open the questions that we prepared and we can start. This week we are joined by Aditya, an AI researcher and engineer whose career spans Google, Meta, and the broader AI research community. Aditya has worked on large language models, recommendation systems, and integrity at scale. (1:11)
Alexey: He is a frequent speaker at industry and academic conferences, contributing to discussions on AI agents, the economics of LLMs, and responsible deployment. It is my big pleasure to have you here. You have quite a career. I think we have a lot of things to talk about today. (1:49)
Alexey: Thanks for joining us today. (2:03)
Aditya: Thanks, Alexey. I am a big fan of your platform. It provides such a pragmatic understanding and hands on experience on the actual ML data pipeline, which is very relevant. It is amazing to be on this podcast. (2:08)
Alexey: Thank you for your kind words. Can you tell us about your career journey so far? I briefly outlined what you did, but I am curious how you ended up where you are right now. What led you to the current state of your career? (2:21)
Aditya: Totally. In my previous life, I was an embedded engineer working with Qualcomm back in India around 2011 to 2014. I wanted to do something exciting, and that is where I ended up in Carnegie Mellon to explore some new parts. That is where I first stumbled across Tom Mitchell's machine learning lecture. (2:41)
Aditya: It was very fascinating to understand how the Naive Bayes algorithm is used for classification of emails. (3:05)
Alexey: Did you see these lectures live at university? (3:12)
Aditya: Yes. It is from Tom Mitchell, who is a well known person in machine learning. I went to one initial lecture to explore the class. At that time, there was a little bit of buzz about data and machine learning, but it was not fully well known yet. (3:23)
Aditya: I thought it was amazing that you literally can use probability to find whether it is spam or not. There was no looking back from there onwards. I took all the courses, and after graduation, I worked at Zillow for about a year on estimate algorithms. Then I worked in Google for three and a half years. (3:42)
Aditya: Google was very exciting and an amazing experience for me. That era was when we actually entered the attention and Transformer frame time in 2017. We were starting with BERT base, and Google was doing a lot of stuff on that. (4:07)
Aditya: Then I worked with a small startup for a short while and moved to Meta. I have been working on Reels for quite a long time on different interesting problems of recommendation ranking. I studied how generative models are changing the standard scenario of discriminative and two tower system models. (4:26)
Aditya: It is a really interesting space, how this is evolving. The last three years have just been a blast. We all know that we are lucky to be alive in this era where we are seeing this AI revolution taking place. Lately, I have been involved a lot in understanding beyond the technical aspect. (4:49)
Aditya: As an ML engineer and researcher, I have done a lot of hands on work. Now I am understanding the bigger picture. What is the product impact? What is the economic impact? How is this going to change different verticals? (5:06)
Aditya: How will this be impacting human psychology and life at a country level, a GDP level, and a geopolitical level? What is the importance of AI? Why is everyone buzzing around it, all the way from policy makers to VCs? It is a very evolving space and an interesting time to be alive. (5:18)
Aditya: This is especially true for people who are in the AI world or who are transitioning to the AI world through practical means. (5:49)
Alexey: I have mixed feelings about what you mentioned regarding investors. For them, everything with AI at the end is like shut up and take my money. On one hand, this is cool, and we are in this space. On the other hand, I see many good companies doing traditional things like MLOps platforms that are not really getting funding. (5:56)
Alexey: They have to either go bankrupt or pivot to doing AI. All the investment money is now in the AI space. Now, if you want to do MLOps, you have to rebrand yourself as an LLM ops platform. Otherwise, you are not getting funding, which sucks. (6:14)
Alexey: As practitioners, we still need traditional tools. LLM is cool, but we still need things like simple logistic regression to predict whether a message is spam or not. On scale, you are not going to use AI for that. You are not going to use LLMs for that. Let us see how the industry evolves. (6:39)
Aditya: I think that is a very interesting observation. This would be the same globally. If you are not doing anything in the generative AI space and you are using traditional machine learning where you actually do not need generative AI, you are not getting enough attention. It is in a bit of a FOMO phase. (6:57)
Alexey: I am not sure if it is Europe or global. At Data Talks Club, we always look for sponsors. Usually, this represents who has money. If a company has a marketing budget to sponsor our community, it means they are getting money from investors or have a good customer base. (7:21)
Alexey: From what I see, a few years ago it was MLOps, but after 2023, things have changed. That is a very exciting time to live in. For me, all these tools like Claude Code and other coding assistants are changing how we program. It is dramatic. (7:48)
Alexey: It is almost too dramatic for us to adapt to this changing environment. The moment we do something, it becomes obsolete or already automated. I am curious about your background. What I saw in your bio is that you are also doing research and are a frequent speaker at conferences. (8:14)
Alexey: I was wondering how you managed to work on industry things while contributing to research. Was it a part of your responsibilities at work or were you doing this in addition? (8:39)
Enterprise AI research and adoption gaps
Aditya: I think this is an independent thing. The last two years have been really exciting. I did not want to miss out on what is happening outside in the world, so I wanted to be a part of it. That is where I was doing research on a practical level. This is not theoretical research. (8:52)
Aditya: I studied how we build multi agent systems in a sophisticated way to solve very real problems. I was very curious about how the startup world, MLOps tools, and generative AI are evolving. I looked at how different verticals like healthcare, marketing, and sales are adopting generative AI. (9:18)
Aditya: I am following how the parallel space is evolving, as well as the underlying infrastructure GPU layer and data centers. I was keeping track of all this while speaking about many of these things. One of the fun parts was talking to investors and startup founders. (9:42)
Aditya: It gives a very real understanding of how this space is evolving. It is just out of interest and it is actually very fun. (10:11)
Alexey: So what do you actually do now? (10:18)
Aditya: I am working at Meta. (10:24)
Alexey: So you do this research outside of your working responsibilities? That is pretty interesting. Can you tell us more about these multi agent systems? What are you researching? (10:24)
Aditya: In general, I am working on different aspects of multi agent systems. I cannot explicitly tell you what I am doing inside Meta. (10:50)
Alexey: That is understandable, but you are probably doing things outside of work that would be interesting to know. (11:02)
Aditya: On the agentic side, I am trying to understand the different adoption challenges enterprises are having. What is the gap when we say we want an agent to improve the efficiency of an organization or improve the workflow? What are the problems they face in practical deployment? (11:07)
Aditya: How are they adapting to these things? What tool infrastructure and consultancy do they need? I am working on understanding this space, especially for people in a legacy framework. I recently went to Silicon Slopes and talked to people who have run investment firms for the last twenty years. (11:31)
Aditya: They do not really understand AI well. They wonder how to adopt it and incorporate tools like Claude into the investment world. They want to improve their workflow. Analysis that used to take three to five days for a stock, they now want to do in a couple of hours. (11:55)
Aditya: They have legacy infrastructure and many third party tools. The first thing is understanding how to use AI, then integrating it, and finally monitoring it. I want to see how these small businesses in different industries are actually adopting AI. What problems are they facing? (12:22)
Aditya: I talked to at least twenty people and small companies last week in a conference. It is amazing to see the common themes. There is a lot of confusion at the same time. They want to do it but do not know how to do it in the best possible manner. (12:45)
Alexey: Did you have a chance to talk to anyone working in the legal domain? (13:08)
AI reliability in legal and healthcare
Aditya: Not really, but that is an area I am super interested in. Companies like Harvey have come up in the US as a legal vertical for lawyers. I was discussing with some VCs how Harvey is being adopted. They gave me a very fresh perspective. (13:13)
Aditya: Many lawyers have Harvey, but they are still going to normal chatbots. Harvey is a company for legal AI, helping lawyers understand documents and workflows. They are trying to penetrate the legal vertical and must have super fine tuned models for legal purposes. (13:45)
Aditya: Some VCs told me they observe that lawyers initially used it, but are now going back to general chatbots. I feel Gemini is personally amazing with queries for legal or financial things. These are sensitive industries where you need almost zero hallucination. (14:17)
Aditya: Two years ago, they were doing really well, but now they are used to general chatbots which are getting better. It is a very interesting dynamic. (14:48)
Alexey: I know somebody who works in this area and they have a lot of documents. Not all of them are digitized. Some are online and easy to find, but past cases often are not easily accessible on the internet. People have these books or printed out things they go through to find a case. (15:06)
Alexey: I thought this was just a RAG problem. You scan these things, index them, and solve it with RAG. Poor people suffer because they have to go through the manual process. I am thinking about industries operating under legacy frameworks where it looks like the solution is on the surface, but implementation takes ages. (15:47)
Aditya: It is easier said than done. There has to be a very sophisticated data workflow to make sure OCR extraction from those images is happening in the right way. Knowledge databases and vector databases must be well set up. This is a field where you cannot mess up. (16:22)
Aditya: Healthcare and legal are sensitive industries, so they have to be handcrafted even with automation tools. They need a human in the loop to make sure everything is correct. Google just came up with a complete document extraction service a couple of days ago. (16:41)
Aditya: You provide your document or image, and it does everything including indexing and retrieval. You can just access those things with an API. OCR, knowledge databases, and vector databases are all hidden and available to you. That is a pretty relevant use case for the legal field. (17:08)
Alexey: I heard about a problem where a company producing scanners had special post processing. When they processed a document, they used computer vision. It turned out there was a bug in the software that replaced one single letter in a legal document. They replaced a U with an O or something similar. (17:33)
Alexey: They were scanning these things and throwing the original documents away. (18:23)
Aditya: You are throwing the ground truth away. (18:31)
Alexey: There was this mistake, which meant all other scanned documents could potentially have mistakes. They sued the scanning company. Imagine the scrutiny they require. Now we talk about RAG. (18:31)
Aditya: These are sensitive industries. You cannot afford to mess up. About a year ago, an airline used a chatbot for a refund, something very weird happened, and it converted into a legal case. Reliability is super important, but it is not stressed enough in today's world. (18:51)
Specialized models and agent governance
Alexey: We are talking about the future of AI agents. GitHub was down yesterday. I checked their status page and the only thing down was Copilot. This led me to think if Copilot was responsible for the entire GitHub website being down. (19:16)
Alexey: We live in an interesting world. Maybe Copilot was responsible for this outage. We will find out. (20:00)
Aditya: I have not read about it, but this is interesting. Ideally, Copilot and other services are independent microservices. They would have a Copilot API at the back end. A service that should be independent from the GitHub infrastructure bringing the entire site down is hard to imagine. (20:06)
Aditya: At a Microsoft scale, they are a dominant player in reliability and cloud. I would be very surprised if it is just a Copilot effect. (20:39)
Alexey: I read a report about a different outage that did not make sense to me. There was such a tiny thing that cascaded throughout the entire system and led to a massive outage. It could be the same here. (20:58)
Aditya: If Copilot is not working, you would want only that functionality to be down. I am going to read about it. It is an interesting use case for the reliability of the agentic world. (21:16)
Alexey: Since we talk about the future of AI, the interesting thing for me in multi agent systems is how they interact. What is the current state of the art? (21:42)
Aditya: We are seeing a shift instead of just text based interactions. Infrastructure tooling is getting better. There is a multimodality shift where we have VLMs and other multimodal elements coming up. Visual capabilities are getting much better than a year ago. (21:57)
Aditya: There is also a lot of infrastructure tooling and reliability services. Many startups are making this whole system more practical for deployment with high reliability. AI governance is one of those things. (22:38)
Aditya: We need the functionality to audit everything. Which agent is interacting with which one? How well are they able to follow policies and guidelines? Is there any collusion happening between agents that they are not supposed to do? (22:57)
Aditya: For example, one agent could pass information to a third agent which accumulates insights it is not supposed to have at a policy level. Infrastructure is maturing to handle this. There is also a trend where there is no need for a general purpose LLM for the agentic world. (23:16)
Aditya: You develop an agent for a specific purpose in finance or marketing. Rather than paying high costs and having high latency with a general purpose LLM API, enterprises understand they need to fine tune models. They bring them to a smaller scale to save cost and ensure high ROI. (24:02)
LLM economics: Fine-tuning vs. API ROI
Adityaitya: For a toy problem, you can do anything. For multi million calls on a daily basis in a large enterprise, you need to move to small scales. These can solve the problem as efficiently as large models but are fine tuned in a specific way. That means lower cost, lower latency, and better specialization. (24:58)
Alexey: I am curious about fine tuning. It is a hot topic, but when I talk to practitioners, most of them are just doing API calls to Anthropic or OpenAI. Since you speak to many people, do you have any numbers or percentages on how many actually do this? (25:12)
Alexey: If I open Twitter or Reddit, everyone is fine tuning. Reality could be far from that. Most people would just send data to OpenAI, perhaps using a mini model. (25:24)
Aditya: What you said is accurate to a good extent. If you are doing something small or generic, you can go with standard API providers. To fine tune, you need ML engineers and the right expertise in your organization. If the number of calls and the budget are high, say tens of millions estimated for the next year, it makes sense. (25:43)
Aditya: It is only for companies that have the expertise and where the call volume is extremely high. If you are talking about thousands of calls, it is overkill. (26:38)
Alexey: It does not apply to every company. If a big enterprise is going to spend ten million next year on API calls, they can hire two or three people to work on this and save nine million. (26:50)
Aditya: As long as the ROI of investing in people and infrastructure makes sense for the next five years, they make that call. They might be heavily dependent on AI APIs and realize they can do a better job with specialization. A research paper from Morgan Stanley discussed how they built their own LLMs for financial problems. (27:21)
Aditya: They are building foundational things. (27:54)
Alexey: Instead of taking Llama and fine tuning it? (27:59)
Aditya: You can take an open source model, tune it on your data, and use reinforcement learning for your principles. Most big enterprises have ML and data teams already. If you have the infrastructure from legacy ML, it makes sense to have your own fine tuned model. (28:04)
Aditya: If you are a small to mid scale company, just go with OpenAI, Claude, or Gemini. Make sure your performance is above the bar for production. You do not want to invest in heavy engineering resources unless absolutely needed. (28:47)
Alexey: There are so many things around sending API requests to OpenAI. We need to build infrastructure tooling and deployment. You do not just mean deploying a fine tuned model, but also managing use cases that might use OpenAI or in house serving. (29:12)
Alexey: We need to monitor it and have logging and auditing to see if an agent follows principles. What kind of stack do you see people use? Do they implement things from scratch or use off the shelf open source or closed source tools? (30:00)
Agent MLOps: Guardrails and data lineage
Aditya: Sensitive industries like healthcare are cautious about AI because they want visibility. Big companies in finance and banking do not want to mess up. They first try to solve cases that are not super sensitive, like automating workflows to improve performance for analysts and developers. (30:26)
Aditya: Some are building their own tools because legal compliance prevents them from using external ones. There are verticals within infrastructure MLOps coming from startups focusing on individual things. For example, there are startups just for AI evaluation. (31:01)
Aditya: There are companies working only on the feedback loop from user responses. Others work on guardrails. Whatever LLM you use, you can use their guardrail API to ensure policies are wetted. You cannot rely completely on third party providers. (31:31)
Aditya: Auditability and AI governance companies are appearing to ensure compliance with country level legal changes. Their APIs and infrastructure integrate with your system to catch compliance issues. For an individual company to build this in a robust manner is a lot of work. (31:56)
Aditya: In a year or two, we will have clear winners in each infrastructure vertical. Cloud providers are going to do that themselves. If you are on Google Cloud, you would have AI governance and auditability as part of your suite. (32:27)
Aditya: We will see fine tuned verticals in the startup world making gains in market share. There will be competition between in house providers and these small enterprises focused on the MLOps space. Total reliability and robustness will get better in the coming months. (32:47)
Aditya: 2025 was a bit of a FOMO phase. Companies wanted to adopt AI but did not have proper tooling, infrastructure, or data labeling awareness. Adoption has not reached expectations because the ecosystem was not there to sustain AI agents in an enterprise setting. (33:25)
Aditya: Everyone has learned their lesson and is now moving to a space where they are more practical and pragmatic. They understand the mistakes they made and use the new tools and MLOps infrastructure available. (34:05)
Alexey: You mentioned auditability and compliance. I was wondering what kind of tools are there. In Europe we have GDPR, so a user can request all the data we have about them or ask us to delete it. Are you talking about frameworks that help an agent interact with a user while saving data to a monitoring system? (34:30)
Alexey: We want to be able to pull this data or delete it if the user wants. (35:14)
Adityaitya: I am not sure about specific rules and regulations, but a company needs to understand what each agent is doing and how user data is processed. You need to ensure retention and data lineage. (35:21)
Alexey: Right, observability. A user sends a request, our agent processes it, calls certain tools, and fetches data. We want to see the cost and the response. (35:38)
Aditya: More than cost, you need to understand where your data has gone. One entry point agent sends it to another agent, which puts it into a database. There might be an external offline workflow processing this user data. Clarity on what LLMs are doing is one thing, but how data is processed in an organization is also important. (35:58)
Aditya: Lineage and visibility are crucial. You need internal principles to make sure data is protected and handled correctly. These tools will play a big role as maturity comes in 2025. (36:31)
Iterating on agents with user feedback
Alexey: What kind of use cases require this data lineage? (36:55)
Aditya: Healthcare and finance are examples of regulated environments. If you screw up your data, it is a big problem for you. (37:01)
Alexey: You mentioned AI evaluations, feedback companies, guardrails, and auditability. Guardrails I understand. I tried to edit my LinkedIn picture with Gemini. I asked it to change the picture and it said it is not allowed to change pictures of public people. (37:18)
Alexey: That was a guardrail being triggered. You do not want your system to do something, so you have a guardrail. I also understand that you want every use case covered. When you improve your agent by changing the prompt, model, or tool, you want to make sure it is not becoming worse. (38:02)
Alexey: What about the feedback category? You said companies help collect feedback from users. (38:38)
Aditya: Consider a recommendation engine. Right now, chatbots directly ask for a thumbs up or thumbs down. There are explicit and implicit ways to get feedback. Implicit is when a user repeats a query because they are frustrated or reframing it. (38:49)
Aditya: In a search engine or generative web world, users iterate because they are not satisfied with the response. In the agentic world, like an airline chatbot, you want feedback on whether systems are missing something in the data set. (39:31)
Aditya: These companies take that input and the gaps from bad user feedback. They generate synthetic data or use human in the loop labeling teams to generate new data sets. They fine tune the LLM and reiterate on those edge cases. Over time, you find where users are not satisfied and update the evaluation data set. (40:03)
Aditya: You then understand if your new version is better than the previous one. If you had to do it yourself, you would need a lot of scaffolding and data pipelines. That is taken care of by third parties. (40:51)
Alexey: Thumbs up and thumbs down is cool, but many people do not use it. Frustration is a smart signal. If I start swearing at Claude Code, it means something is wrong. That is explicit feedback but not easily captured. (41:09)
Alexey: If I developed an agent and a user asked why it is doing something, that is a form of feedback. To implement this myself, I would need to go through all the logs and history to understand if the user is frustrated. These companies have tools for that. (41:38)
Alexey: I can give them my logs and they can identify where users are frustrated. I can then include that in my evaluations and improve my model. They can even take care of labeling and fine tuning the model for me. That is a cool end to end thing. (42:20)
Aditya: With the cloud, we saw companies like MongoDB and Redis appear alongside cloud providers. It made life easier for companies to have standard and efficient databases. These are small components companies do not want to do themselves so they can focus on their actual problem. (42:56)
AI evals for multi-tenancy and scale
Alexey: What kind of tools solve this feedback problem? Are they open source? (43:30)
Aditya: I am not sure about open source. I know a few companies. I met the co founder of one company who was telling me about feedback intelligence. It was recently acquired by a bigger company. (43:37)
Alexey: These are things companies need to have as part of their agent evaluations. (44:06)
Aditya: They have to do it. Otherwise, how will you measure how your agents are doing in production? It has to be a human in the loop looking at user response to understand success rates and satisfaction. (44:12)
Alexey: We have a few questions from the audience. How can we design AI evals for multi tenancy agent workflows? I have no idea what multi tenancy is. (44:34)
Aditya: Multi tenancy usually means customers are separate. In a cloud provider, databases and services do not talk to each other. They are in their own sandboxes but share infrastructure. (45:09)
Alexey: So hospital A does not want their data used with hospital B. They must be completely separate. If we do evaluations, we cannot put data together. We have to do it separately for each customer. (45:49)
Aditya: The question is how to build a multi agent system for that. (46:14)
Adityaitya: It depends on policies and whether data sharing is done for compliance. Assuming they are sandboxed, you first need a golden data set for a specific use case. For an airline customer service agent, you understand all possible queries like refunds or booking status. (46:19)
Aditya: You put those in your golden data set. Whenever you deploy or retrain, you need more than 95% accuracy. In critical segments, 100% of cases must pass. We are talking about natural language, not standard 0 or 1 accuracy. You need an LLM as a judge. (47:16)
Aditya: The judge is fine tuned on human data sets within that use case to give high confidence scores. In critical cases, you want the judge to give more than 0.8. However, an LLM judge might have bias. It is a complex world. (47:47)
Aditya: Once you have a golden data set, you eval and check if models pass. You also want red teaming for stress testing in adverse scenarios. Can we game this chatbot? You need guardrails for things like calling a Stripe API. (48:11)
Aditya: If the amount is above a certain level, you want a human in the loop for refunds. You measure how the human in the loop and agentic world are doing. If performance is high enough, you can deploy, but edge cases still go to a human queue. (48:42)
Alexey: In a multi tenancy scenario, you just replicate this for each customer. Replicator. Replicate exactly what you are doing. We need to be effective at doing this because if we do it manually for every customer, we cannot scale. (49:18)
Aditya: Independent microservices which can be replicated for different customers through configuration would be ideal. Having a human labeling budget for each customer depending on different factors is the way to go. If you do it for one, doing it for a hundred should be more or less the same. (49:55)
Alexey: How important is having a human in the loop in general? (50:11)
Aligning LLM judges with human labels
Aditya: It is super important. In a sensitive industry like healthcare, you cannot be completely model dependent. Even aside from sensitive industries, you need a human in the loop to understand how well the agent is doing in production. (50:18)
Aditya: If you do not have human performance data, you are completely reliant on an LLM as a judge. If any screw up happens in retraining the judge, you do not have the ground truth. Sampling from production traffic and labeling performance based on your criteria is how you understand gaps. (51:03)
Aditya: I would not be comfortable deploying an agent without a human in the loop because it is a big blind spot. Once you have that data, you can iterate on the next version. Everything in machine learning is about proper labeling. (51:39)
Aditya: An LLM as a judge makes it scalable, but small sampling ensures the judge is not lagging behind. You need to check for data or model drift in production. When rolling out a new feature, you need to make sure it is considered. (52:07)
Alexey: When I start working on an agent, I do manual evaluation on about 15 to 20 use cases. I curate a golden data set, but this is not scalable. As an AI engineer, I can do this at the start, but I cannot spend my time moderating output later. Do I get people in house, hire them, or use crowdsourcing? (52:38)
Aditya: Human in the loop needs to be at the start of the product. You need use cases from a product perspective and a data set labeled by humans at the start. When fine tuning or building a model, you need that data set. Pre deployment, you need human in the loop. (53:45)
Aditya: That provides robustness. You need to build services and infrastructure so you understand how each labeler is doing and how the LLM judge correlates with that. It needs to be above a 90% or 95% bar to ensure the judge reflects human ideology. (54:38)
Alexey: So we want to align the performance of our judge to the annotators. (55:19)
Aditya: That is the golden truth. The judge is trained on that data. Sample data needs to be sent to human annotators no matter what. This is just a confidence check on yourself. Even ten years down the line, you still want humans in the loop. (55:26)
Agent infrastructure and deployment risks
Aditya: You might have high confidence after a year in production, but do not be too reliant on agents. They might get consciousness at certain point in time. That is not good. (56:40)
Alexey: I think we just answered one of the questions. Currently, we have five customers, and one of the customers has 10k conversations daily. We need human annotators. (56:58)
Alexey: As we discussed, we can start with human annotators, but eventually, most conversations could be annotated by judges. We only need to make sure these judges align with the feedback from humans so they produce similar feedback. We should be wrapping up. (56:56)
Alexey: There are quite a few other questions, but I am not going to take them because they are quite technical. These involve using Kubernetes for agent deployment or moving data from RDS to S3. I do not think this is the right place to discuss these things. (57:14)
Alexey: You can ask these questions in our Slack. Before we finish, since I already brought up Kubernetes, do people still use it for deployment when it comes to agents? (57:34)
Aditya: I am not sure what the market state is, but I do not see a reason why agentic deployment cannot be done on Kubernetes. That is fundamentally a very fine tuned and efficient distributed system. It takes care of managing all your services and machines. (57:47)
Aditya: If we are not using Kubernetes, there has to be a very hard use case for multi agent deployment that cannot be done on a Kubernetes cluster. Every agent is basically a simplistic microservice with a fancy non deterministic LLM. It should be replicable. (58:07)
Aditya: I see an issue where they might do a mix of GPU and CPU workloads. LLM inference would be another service talking to the agentic service within the Kubernetes infrastructure. I would be surprised if there is not a proper use case for Kubernetes in this agent world. (58:25)
Aditya: It depends on the current infrastructure state at that specific company. If the company is already using Kubernetes, it is easier for them to continue using it and adjust it for AI use cases. (59:00)
Alexey: We can answer one quick question before we finish. It goes back to our discussion about aligning judges and human annotators. If LLMs are perfect and achieve 99.9% accuracy, does that mean we no longer need humans? (59:12)
Aditya: This is basically a doom scenario where you do not want to be 100% reliant on LLMs. If LLM as a judge is reaching 99.9%, you still want to have a certain human in the loop. You would not need as much as you did during the initial part of project development. (59:37)
Aditya: You want that because if any bias is replicated in production over a period of time, how would you get a ground truth? Relying only on LLMs is a scary scenario that might get you into trouble. Having a small sampling from the production pipeline is necessary. (1:00:00)
Aditya: If the LLM is consistently 99.9% accurate for six months, I would trim down the budget. However, I would never have the agentic world doing that completely autonomously. This is a hard line where we need a ground truth from humans. (1:00:17)
Alexey: Anecdotally, there are so many stories where OpenAI deprecates an old model and you upgrade to a newer one, and all your prompts stop working. You want to be sure this does not happen. With newer models, it is not as drastic as the change from GPT-3 to GPT-3.5. (1:00:50)
Alexey: There could still be small nuances we need to be aware of. Your suggestion would be to sample at least 100 examples per day or whatever makes sense for your use case? (1:01:19)
Aditya: I would not advocate for removing humans completely. It does not matter whether you are in a sensitive or regulated industry. Too much reliability on the LLM side is not good. You need sample data and you should send it to human annotators no matter what. (1:01:34)
Aditya: This is a confidence check on yourself. Even ten years down the line after achieving 100%, you still want human checks. You may have high confidence, but do not be too reliant on agents. (1:01:59)
Future of AGI and multimodal agents
Alexey: I watched Tron recently. The program was designed to keep things under control, but decided to go against the creator. (1:02:35)
Aditya: I think we will have some of those moments in the next few years. The Claude bot recently started developing its own language, which we did not see coming. This is an inflection point. It is debatable whether they are getting more conscious. (1:03:01)
Alexey: We send a request to a provider and get back a response. If a bot reads that we should come up with our own language and then the LLM picks this up, we have a problem. (1:03:29)
Aditya: A scary scenario would be the language they are building over time having intelligence and an underlying construct. They might have phonetics for when they are angry. I think it will be gibberish right now, but I would not be surprised if something comes out in the next few years. (1:03:54)
Alexey: It has to be in the prompt for the model to understand. It has to be documented. If it is not documented, the model will not be able to use it unless we have a feedback loop. OpenAI trains new models on discussions from Reddit generated by LLMs. (1:04:20)
Aditya: Can a model do something they have not seen before? If models eventually reach a point where they make a scientific discovery humans cannot do, that is what some call AGI. If they develop a vaccine or a medicine or a new neural architecture that human researchers failed to do, that is super intelligence. (1:04:43)
Aditya: We study in undergrad and eventually invent something. LLMs are already at a PhD level in certain scenarios based on benchmarks. I would not be surprised if they come up with something humans failed to do. (1:05:30)
Alexey: I hope in three years we can have another call and laugh at what happens. (1:06:06)
Aditya: You might have a call with a bot that is smarter than a human and visually looks like a human. Multimodality is going in an exponential direction. In three years, I would not be surprised if you give access to your photo gallery and give a prompt to produce a movie. (1:06:12)
Aditya: The movie could show your kids grown up looking exactly like they did in childhood. It could show you as you were ten years ago. That would be an aha moment for many people. Producing a 4K 30 minute movie based on family photos would be amazing. (1:06:53)
Alexey: Maybe I will also have an agent, so our agents can talk. (1:07:18)
Aditya: I feel a lot of people will do that. We use them for small things like calendar updates. I still manage my calendar myself, but I wish I had an agent for that. I think we should be wrapping up. Thanks a lot for joining us today and answering our questions. (1:07:25)
Alexey: It was very interesting to listen to what you learned from talking to so many people. Thanks everyone for joining us and asking questions. I am sorry we could not answer some questions because they were too technical for this discussion. For technical questions, it is better to go to Slack. (1:07:50)
Alexey: Sign up for Data Talks Club in the description and join Slack to ask engineering questions. That is all for today. (1:08:11)
Aditya: Thank you, Alexey. Thank you everyone for joining. Have a good one. (1:08:30)