Season 22, Episode 1

Building reliable AI products in the era of Gen AI and Agents | Ranjitha Kulkarni

Listen to or watch on your favorite platform

Links:

About this Guest

Ranjitha Kulkarni

Ranjitha Gurunath Kulkarni is a Staff Machine Learning Engineer at NeuBird.ai. Previously, she built LLM- and agent-powered product capabilities at Dropbox Dash and worked on speech recognition, language modeling, online metrics, and assistant evaluation at Microsoft. Her publications span voice query reformulation and automatic online evaluation of intelligent assistants, and her patents include automated closed captioning using temporal data and hyperarticulation detection. Ranjitha holds a master’s from Carnegie Mellon University (Language Technologies Institute).

LinkedIn Ranjitha Kulkarni on DataTalks.Club

You'll be subscribed to our newsletter and receive a Slack invite in 3 minutes.

Click any timestamp to jump to that moment in the video

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Career journey and early curiosity

Alexey: Hi everyone, welcome to our event. This event is brought to you by DataTalks.Club, which is a community of people who love data. Every time I say that, I remember that when this automatic speech recognition software processes this, it recognizes what I say as Data Docks Club. I need to work on my articulation. These recognizers can actually deal with my accent. (0.0)

Alexey: DataTalks.Club. Anyways, we have quite a few events in our pipeline, although recently we had quite a few already. If you want to stay up to date with all the events we have, there is a link in the description. Go there, click, and you'll see all the events we have. You can also subscribe to our newsletter. We constantly send reminders about future events. (33.0)

Alexey: Do not forget to subscribe to our YouTube channel. This way you'll stay up to date with all the future streams. I'm just curious, how many subscribers we have. I think this is right now 66,000, which is pretty outdated. So if you want to be 66,001, do it now. (1:03)

Alexey: Last but not least, you can join our Slack community if you want to hang out with other data enthusiasts. Check it out. If you have any questions that you want to ask during the interview, there is a pinned link in the live chat. Click on that link, ask your questions, and we will be covering these questions during the interview. This intro that I do is the same every time, so for me, it gets a little boring, and I improvise a little. (1:33)

Alexey: Let me clear my throat one second. Sorry, I think I'm good. Ranjitha, are you ready? (2:18)

Ranjitha: Yes, Alexey. (2:24)

Alexey: Great. Today on the podcast we are joined by Ranjitha, staff ML engineer at Noird at.AI, previously at Dropbox. You have been building LLM and agent-powered products, with earlier work in speech recognition and NLP at Microsoft, and research at Carnegie Mellon. You have quite a nice career journey. It's a big pleasure to have you here at our event. (2:30)

Ranjitha: The pleasure is mine. (2:56)

Alexey: Okay. So tell us about your career journey so far. I think I outlined it, but this is pretty interesting. I see you have worked at quite a few companies, so please tell us more. (3:01)

Ranjitha: Yeah, definitely. In a nutshell, my career is basically machine learning and NLP from the start. It started during my undergrad, which was my first time I came across this stuff. That was back in 2010 or 2011, I think. We didn’t really have many courses on AI or any structured way of learning back then. My friends and I were curious, so we just started building something like an image search engine. (3:12)

Ranjitha: This was back when we were using Sundance search engine and OpenCV for extracting features from images and so on. It was kind of like a bag-of-visual-words approach or how this approach was called. So we were extracting features from the images and doing segmentation. It was hard to do, but it got me hooked on this whole space. I just wanted to learn more. (3:55)

Speech recognition at Microsoft

Ranjitha: That’s what brought me to CMU to do my masters. I learned a lot from amazing professors, colleagues, and some startups that were there locally in Pittsburgh. From there, I joined Microsoft, where I worked on speech recognition, language modeling for Xbox, Bing Search, Bing voice search, and Cortana. I met some amazing people who had been doing this for many years and learned a lot from them. (4:25)

Ranjitha: Then I wanted to broaden my scope in ML a little bit, so that brought me to Dropbox. Initially, I started working on recommendation systems. (4:57)

Alexey: What kind of recommendation systems are there? I’ve been a Dropbox user for many years and don’t remember it recommending me a single thing. (5:06)

Ranjitha: Dropbox web provides suggestions on files we think you might open based on your usage. (5:19)

Alexey: I understand. Since I don’t use Dropbox through the web, for me it’s just a folder. But when I open Google Drive, I see recommendations. So Dropbox has something similar? (5:34)

Recommendation systems and early agents at Dropbox

Ranjitha: Yes, it has had it for a while. I worked on that feature and some auxiliary things around it, trying to improve the quality. That’s where I dived more into neural networks to understand how these things work, because when I started my career, neural networks weren’t a thing yet. Later, I moved back into NLP and started working on building a first question answering system. This was an incubation team at Dropbox that wanted ML expertise. (5:52)

Ranjitha: Then my manager and I, both amazing engineers, started a new team focusing on agents. This was back in December 2022, before the term “agents” was popular. We were trying to innovate, not just on Dropbox, exploring other data sources and ways to improve the system. I was fine-tuning T5, and then GPT-3.5 came, which disrupted the whole market. (6:34)

Joining NewBird AI

Ranjitha: After working on agents at Dropbox, I was drawn to Noird, where I am now. I’m fully immersed in the potential of these agents. We are trying to solve the problem of engineering on call, taking that away from users and letting agents handle tasks. This allows engineers to focus on the more fun part of building the system. (7:44)

Alexey: That’s interesting. As a developer, I would be happy not to be on call. My colleagues, who were software engineers, had to be on call, and when they received a notification from PagerDuty, they would open their laptop, check the logs, and do all this stuff. For them, it’s good not to do these things anymore. (8:17)

Ranjitha: You are lucky. Even though I’ve been doing machine learning forever, I’ve had to be on call many times. I totally understand the pain of being woken up at 3 a.m. (8:59)

Alexey: That’s why you decided to join the company? (9:11)

Ranjitha: Exactly. That vision is what brought me here. (9:16)

Alexey: Do you know where the name comes from? (9:20)

Ranjitha: I actually don’t know. I think the founders are just really into birds. (9:22)

Alexey: Noird, as I mentioned, “noi” means “new” in German. Maybe “new bird” was already taken, so they chose this name. (9:30)

Ranjitha: I will have to ask them about that. (9:37)

Alexey: It’s very difficult to start a startup now because most domain names are already taken. You need a name that’s not taken but also doesn’t sound like gibberish. (9:43)

Ranjitha: Yes, there are a lot of creative names these days. (9:55)

Alexey: I remember Kaggle. I was interviewed with one of the creators, Anthony Goldbloom. They wrote a script to go through all relatively short names, check if they were available, and that’s how they came up with Kaggle. (10:01)

Alexey: So, what do you do at Noird? (10:29)

Ranjitha: I’m building agents and everything around the agent ecosystem. It’s been just four weeks since I joined, and I can already understand the sheer complexity of the problem. (10:35)

Alexey: You just joined? (10:49)

Ranjitha: Yes, just a month ago. (10:51)

Alexey: Congratulations. (10:54)

Ranjitha: Thank you. I’m working on building the whole story of how LLM agents can work more reliably so that customers who are happy today are still happy tomorrow. (11:00)

Alexey: You’ve been in the industry for quite some time. You also started machine learning in 2010. My interest started around 2012. I remember taking courses on reinforcement learning where the agent was defined differently. So I’m curious, how do you define an agent? (11:14)

Ranjitha: Right. This is something I usually start all my talks with: “What is an agent?” because it’s not a very well-understood concept. (11:48)

Defining agents and LLM orchestration

Ranjitha: Back in reinforcement learning, agents were basically tasked to complete a goal or objective. Agents are automated pieces of software that go and complete the given task. You tune them to improve performance according to the objective function. The core idea remains the same: autonomously completing a given task. (12:01)

Ranjitha: Today, LLMs are at the core, powering these agents. LLMs are the brain of the agents. What defines a type of agent varies because everyone has their own recipes. Agents orchestrate multiple calls to LLMs, tools, knowledge stores, and memory. At the end of the day, an agent performs a task autonomously with the help of LLMs, tools, memory, and storage to please the user. (12:31)

Alexey: I experimented with agents the other day. I thought, can I make an agent without an LLM? The answer is yes. We can make decisions with other mechanisms. Right now, we delegate decisions to an LLM, which acts as the brain. But it could also be simple heuristics depending on the type of agent. (13:31)

Alexey: In my case, it was an environment with people applying for jobs and employers. They interact based on skills and job positions. The goal was to simulate these interactions. Without any LLM, it still worked. But adding an LLM lets one entity talk to another, like an interviewer asking questions in a simulation. (14:01)

Ranjitha: Yeah, a fun side project. (15:04)

Alexey: Would you agree with the definition that an agent is just an LLM with tools? (15:10)

Ranjitha: It’s not just that. There is a wrapper around it, and the way you plan it can differ. Agents can complete a task in a single step or multiple steps. A step might involve calling a tool, processing information, or other actions. Complexity grows with multi-step processes. (15:23)

Agent planning strategies

Ranjitha: There’s also the concept of iterations and feedback loops. Is it a single pass or multipass system? Multipass involves self-reflection and correcting plans based on outputs. These multipass systems make the most complex agents, which everyone is trying to develop. Many start with single-step or single-pass systems, executing a plan to achieve the goal with some amount of self-reflection. (16:11)

Alexey: You mentioned planning is different across agents. You can have an agent with no LLM. If I ask it to create a Django project, it might not execute that. A simple LM can create something, but not as elaborately as a reasoning model. (17:17)

Ranjitha: Exactly. The plan can be dynamic or predetermined. In your example, the agent had a fixed task, a fixed set of tools, and an order of steps. That can work with some fail statements. Planning doesn’t always need to be dynamic. (17:48)

Agent implementation approaches

Alexey: How is this implemented? Frameworks like LangChain or AI Agents SDK allow you to define system prompts, user prompts, and tools. The agent invokes tools to execute tasks. Where does planning come in? (18:23)

Ranjitha: With predefined tools and SDKs, the logic of planning is embedded. Prompts and tools define how the plan is generated. Most startups or companies define their own agentic platform with custom tools and SDKs. It’s very hard to build this generically. It’s easier to focus on the problem you’re solving and see what type of agent makes sense. (19:06)

Ranjitha: Some agents plan in plain English, others in code so-called code agents. The choice depends on the task complexity. For natural language problems, natural language-based agents work. For very complex tasks with many steps and conditionals, programmatic/code agents are better. (19:58)

Ranjitha: I prefer code agents. They reduce ambiguity and improve predictability. Many use cases fit well with code agents. It just feels like most of the experimentation leads to this type. (20:33)

Alexey: Can you talk about the agents you use at work right now, or is it not something you can share? (20:53)

Ranjitha: I mean I can't share too much in detail, but it's basically the… (21:00)

Alexey: I'm just curious what you are working on specifically? Like, what kind of agents, of course you cannot share or shouldn't share a lot, but also in many cases the devil is in the details, right? (21:08)

Ranjitha: Exactly. I can tell you, like, at a high level basically the agents are trying to build the right context to present to the LLM. Right, so a lot of it is I think a lot of people are talking about this these days called context engineering. But I think this is something that we had to bake in from the time I've been working on agents because back then we had models that had a 4k context window. You don't have the luxury of stuffing everything into the context, so you have to be very deliberate about what it is that you're kind of sending to the LLM. (21:21)

Ranjitha: And what we learned in that process is that when you're doing that, you're not losing much. You learn to kind of put the context into LLM's windows by only providing, you know, how do you define it, like something like the metadata or the way to plan and things like that. So you ask the LLM, "Hey, what are the steps to do something like this?" rather than giving it all the data and saying, "These are all the documents I have, can you go dig into it and find me the right answer?" So it's a lot to do with context engineering. Even at Newbird, we are doing a lot of that trying to reduce the noise that you feed into LLMs. And it's a complex agentic system that they built even before I joined. (21:59)

Context engineering essentials

Ranjitha: It's so many tools. If you can imagine being an SRE what would you do right? You have to look at so many different data sources, different logs, and metrics. (22:50)

Alexey: So I imagine, let's say I am on call and I receive a message saying that this system is failing right? And then I open whatever thing for logs, then I investigate the logs, then I go to Kubernetes and check the deployment ports or whatever, find about the logs and try to find out the reason. I may go to the source code to actually see where the issue is coming from, and maybe I do a quick fix or roll back to the previous version. Right. So that's what I would do. Does the system you have kind of mimic this behavior? (23:06)

Ranjitha: It does mimic that kind of behavior in its own special way.. It's basically this is the way you as an SRE would do something like this, but an agent at the end of the day is basically now like 23, 24 of us are building it together, right? So all of our personalities go into it, and what wins eventually is something that can do it again and again repeatedly, that can solve a problem again and again. (23:44)

Ranjitha: What our customers come and say we get a customer email saying, "Yeah, this really worked well for us," and those are the things that we learn from. That is the feedback we get. As a startup, you don't yet have that much luxury of data to look at right? You're waiting for these individual customer feedbacks. Our team is amazing in the sense that I'm new to startups, and when I look at it, I'm so amazed by it. (24:19)

Ranjitha: Everybody is harping on anything if a user says something is not working, everybody goes, "Okay, it should have done this, it didn't do it, and how do we fix it?" And that's the kind of mentality that's going to propel the startup forward. (24:46)

Alexey: And I just, what occurred to me is every setup, infrastructure, can vary drastically from company to company. One company can use Datadog, another uses New Relic, the third one uses something built in-house, on-prem, then there's ELK or ElasticSearch, there are so many different tools. Some use Kubernetes, some, god knows what, whatever deployment. There is such a variety. How do you, like, have to at some point restrict and say, "Okay, I work only with this tech," or what do you do? (24:59)

Ranjitha: True, true. There is that thing where there is a cap after which you say, "We will keep these as highest priority," and deprioritize some of them. But the challenge and beauty of these agentic systems is that you bake in this generality you can abstract away the details of which API you are talking to, what kind of data you are looking at. You abstract that away so your implementation is generic enough. (25:43)

Ranjitha: It's just a matter of building those connections with the agent. If people are building connections when a new kind of data source becomes available, then you're mostly good. There are peculiarities, of course, like a particular data source might come with its own peculiarities. For example, "No, we will write the log in a reverse fashion, where we'll put status codes at the end and the error message before." Things like that happen too, and for those things you kind of need domain knowledge. (26:08)

Ranjitha: Unfortunately, there's nothing you can magically do about this, and that's where you consult SRE. We have an SRE on our team as well. (26:53)

Alexey: You don't fully dedicate it to agents yet. (27:05)

Ranjitha: No, no, no. It's all agents, but yeah. One SRE does I see, I see, a joke. (27:13)

Ranjitha: No. Yeah, so it's something that you're constantly learning from humans who have been doing this for ages and trying to imbibe that knowledge into this. But the difference is as a human, you might be good at solving one type of those problems or one set of tools, right? (27:20)

Ranjitha: Your learning curve is much higher compared to this agentic system, which if you build it once, has the capability to spread across multiple tools and multiple integrations. So that's the advantage of having this, making an agent do this versus a human doing it. (27:33)

Alexey: Yeah, interesting. You mentioned one keyword that I see quite often these days on LinkedIn and Twitter, sorry, X on social media, basically context engineering. What is context engineering and how is it different from prompt engineering? (27:59)

Ranjitha: Okay, I think it's just a mind shift. It's more of a rephrasing or rewording of the whole thing so that you look at it from a different perspective. Prompt engineering has always been this thing where I give my instructions in a way to make the model work. If I were to put it succinctly, I would say context engineering is like a subfield of prompt engineering. (28:17)

Ranjitha: But the main focus when people said prompt engineering was, "I will put this instruction at the top, I will move the instruction to the bottom, I will write in all caps," things like that. That was the core of prompt engineering most of the time. Context engineering is just being more deliberate about what information you give to the LLM rather than stuffing everything in. You can’t just give a thousand documents and expect it to understand everything fully. (28:52)

Ranjitha: We still need to reduce the amount of noise that we put into an LLM’s context, and that’s what context engineering is. (29:30)

Alexey: Yeah, I remember at the beginning of this year seeing posts that RAG is dead because we have these models that can take your entire knowledge base and answer questions based on that. From what I see, RAG is not dead yet, right? (29:35)

Ranjitha: Yeah, definitely not. (29:51)

Alexey: The main idea is yes, technically, there are models that can take your entire knowledge base, code base, or whatever base, but then how good are they at using all the information you give them and how fast? You probably don’t want to wait too long for the LLM to process a huge prompt with everything. So probably you want to be smart about what kind of information you give to an LLM. (29:58)

RAG evolution in agent systems

Ranjitha: Yes, it is latency, it is cost, and it is also garbage in, garbage out. If you put a lot of noise in, then your model only has so much to work with. They have become more capable, you can fill up to a 32k context window, but beyond that, many models don’t do very well. If you want it to work reliably every time, you want to reduce that to a smaller context window, so you don’t burden your LLM with runtime processing everything. Doing some preprocessing beforehand helps. That’s where experience in machine learning comes in handy when dealing with LLMs. RAG is not really dead, but it doesn’t solve everything and has its shortcomings. (30:27)

Ranjitha: We are now realizing a world where RAG has shortcomings because LLMs are really smart, but the backend that supplies context to LLMs is an old information retrieval system. Historically, it wasn’t designed for this use case. It gives 10 blue links, and humans go click on things. Now we are trying to morph it into something that fits the problem of feeding context into LLMs. (31:38)

Alexey: Would you agree that RAG is one of the simplest examples of context engineering? Instead of giving the entire database of 5,000 records, you select the top 10 most relevant ones, and by providing that, you engineer the context. The LLM knows what to focus on instead of going through the entire knowledge base. (32:16)

Ranjitha: Yes, at a high level, it is that. Along with that, there is a wrapper that presents information in a way that is more conducive for the LLM to understand. For example, you can chunk documents into lines of 200 lines each. But that is almost always lossy. If you just break it by length or by paragraphs, there is still engineering needed to embed context into the chunk. You need to know which document it is from, what question it tries to answer, and what has been learned so far. That improves the quality of your system in RAG cases. (32:48)

Alexey: What are other examples of context engineering? (33:56)

Ranjitha: Well, a lot of agents work by saying, "These are the tools I have available," and "This is how I solved this problem before." Giving examples influences how the LLM thinks and produces output in a structured format. All those are examples of influencing the LLM by engineering the context so that the output is meaningful, rather than, "These are all the logs, go look at it." (34:02)

Alexey: In the case of Newbird, what kind of context engineering do you have? Is it all variations of RAG where you index all the logs, or are you doing something else? (34:50)

Ranjitha: Yeah, there’s a bit of this and a bit of that. I can’t tell you more than that, but we are influencing the LLM. You may not apply RAG everywhere. For search use cases with billions of documents, you would apply RAG. But for very task-specific things with domain knowledge, you apply the right tools. I view RAG or search information retrieval as a tool itself. You use it when needed. (35:09)

Ranjitha: When people say RAG is dead, it’s because vanilla RAG, doing embeddings and vector search and putting it into LLMs, is a very set workflow. What we are moving toward in agents is getting rid of set workflows and making it dynamic. Agentic RAG is one step, and agents go further, taking RAG as a tool rather than an end. (36:11)

Alexey: So basically, we have an agent. The agent has some tools. One of them is search, which performs search in the database that we have chunked and preprocessed, but we let the LLM decide when it needs to use it. (36:41)

Ranjitha: Exactly, if needed. (37:04)

Ranjitha: There are many ways to query this information. You can do a search query, or you can say, "It’s in a table, get me stuff from the table," or "It’s in MongoDB, get me all documents with this value." These are just tools, and knowing which tool to use when is something we teach our agents. They do the orchestration. (37:04)

RAG vs agent use cases

Alexey: Since we started talking about this, the industry is converging to RAG tools. What kind of business problems are good for these solutions? When is classical RAG enough, when is RAG one of the tools, and when do you not need RAG or AI at all? (37:39)

Ranjitha: That’s a good question. I think RAG works well for reducing a really large search space into something small. RAG does well when you have a large search space and the task is simple, like question answering based on a piece of content. Imagine Dropbox with millions of documents; you don’t need to organize data neatly. You want to find a keyword or an answer to a question. (38:13)

Ranjitha: RAG is less good when context matters, like what you were doing until now, the time of day, or what is happening right now. For a needle-in-a-haystack problem with lots of knowledge, RAG is still useful. When problems get complex, with multiple data sources, dynamic planning, or multiple API integrations, you move to agents. (38:59)

Ranjitha: By dynamic planning, I mean whenever an input comes, you want your LLM or agent to plan its trajectory based on the input. (40:00)

Alexey: Like when I get a task, I might subconsciously think about the sequence of actions I need to execute. If a PagerDuty notification comes, I look at it and decide, "I need to go there first, then here, then here." (40:06)

Dynamic planning in AI assistants

Ranjitha: Exactly. So a very simple, easy-to-understand example would be a calendar assistant. You talk to it and say, "I want to schedule a meeting for half an hour tomorrow with my manager or my skip level." There are many hidden understandings or context that I assume the LLM knows about me, such as the meeting duration, my time zone, who my manager is, and their working hours. The LLM has to figure these out and check the calendar to see when both are free. (40:30)

Ranjitha: It might also consult another source to find out who the manager is. Then it has to check if they have a Google Calendar or an Outlook calendar. The LLM is coming up with this plan as the input changes. For example, if my input changes to "Summarize my meeting notes from yesterday," it won’t go to the calendar. It will go into Zoom transcripts and summarize meetings that happened yesterday. (41:18)

Ranjitha: Does a tool like that exist? I was building something like this at Dropbox before I left. Lots of people are actually building this kind of assistance. (42:00)

Alexey: I imagine so. I often use voice recognition with ChatGPT to communicate and then check the result. I could just say, "Summarize what we talked about yesterday on the podcast with Ranjitha about agents," and it would fetch the transcript, summarize it, and give it to me. Easily doable. (42:13)

Alexey: Okay, I want to have that. Let’s meet for a hack project. You said Dropbox has it. (42:54)

I productivity tools at Dropbox

Alexey: Do I need to put all the things into it? (43:00)

Ranjitha: Right now, the product I was building, I don’t know what has changed since I left, is Dropbox Dash. It is AI-powered search and assistant. It helps with productivity use cases and connects to all the SaaS you use at work. You should check it out. (43:06)

Alexey: I guess I need a paid account, right? (43:26)

Ranjitha: I do not know the answer to that right now. (43:32)

Alexey: I am checking and there is a "Contact Sales" link. That’s a problem because Dropbox is used by enterprises, which is their main target audience. Multiple people use the same folders or Dropbox. (43:38)

Ranjitha: Yes, because the content there is much more expanded and discovering other people’s documents becomes more challenging. It’s a better fit for enterprises. (43:56)

Alexey: At the beginning, we discussed frameworks. I mentioned Pedantic AI and OpenAI Agents SDK. Do we need to learn frameworks? If yes, which ones make our life easier, or should we implement everything from scratch? (44:08)

Ranjitha: I’ve tried a few frameworks, but at work I don’t really use them. One I liked is called Small Agents, a Hugging Face transformers library. It is programmatically thinking and helps bootstrap agents. I would still suggest building from scratch because frameworks can become complex, and you may not understand what the agent is doing. (44:36)

Ranjitha: An agent is a bunch of tools and instructions. You can implement your own memory and data queries. That’s my suggestion. (45:39)

Alexey: What do you think about LangChain? (45:52)

Evaluating AI agents

Ranjitha: LangChain has its uses, but I haven’t used it much for agents. Early on, it couldn’t handle ambiguity in natural language. It has improved and has new agents to experiment with. (46:00)

Alexey: I experienced debugging nightmares with LangChain. OpenAI Agents SDK is lightweight, so I understand what’s happening. With LangChain, it’s often unclear. (46:45)

Ranjitha: Yes, it’s a complex mesh of multiple things talking to each other in natural language, which can get confusing. (47:21)

Alexey: I agree with your suggestion to implement agents from scratch. It helps understand how multiple agents can work, and that one agent can be a tool for another agent. It clarifies how these systems actually work. (47:33)

Ranjitha: Eventually, agentic platforms will be usable as-is. At Dropbox, we dreamed of an agentic marketplace where people can build agents and buy or interact with each other’s agents. (48:00)

Ranjitha: MCP solves the problem of having one protocol to communicate with tools, but that’s it. (48:35)

Alexey: You mean marketplaces where, for example, an AI site reliability engineer can access data from New Relic, AWS CloudWatch, and other systems. MCP makes this easier. (48:49)

Ranjitha: Yes, MCP is useful if you don’t want to provide full API specs for custom tools. It abstracts your tools. The agent marketplace is different; it’s a complete agent solving a task. ChatGPT plugins are similar but not exactly the same. (49:24)

Alexey: How do you evaluate agents? For RAG, evaluation is simpler because the flow is fixed. With agents, we need to evaluate the answer, tool calls, parameters, and more. (51:17)

Ranjitha: Good question. RAG is not easy either; it takes time to build a system. Public benchmarks like SQuAD evaluate model capability, not your system. You need to create your own dataset that represents real users. (51:42)

Reliable tool usage challenges

Ranjitha: Agents multiply the challenge with multiple LLM prompts. You also need to evaluate if the tools are doing what they are supposed to do. We approached it like software engineering tests. The agentic system is like a software system: input gives predictable output. (53:20)

Ranjitha: Integration tests mock inputs and assert outputs. For a calendar agent, 200–300 test cases help ensure reliability. For S agents, mocking logs, metrics, and data sources is more complex but doable. (54:19)

Alexey: You mock tools to avoid external service calls. You check if the agent actually tries to communicate with them. For example, setting up a meeting calls the HR system to find a manager. (55:13)

Ranjitha: I wouldn’t evaluate each path too strictly because LLMs can accomplish the same goal differently. Tool calls must consult the true source. For example, two ways exist to find a skip level: directly or by traversing an org chart. Both are acceptable. Adding more data sources increases variability. (56:02)

Alexey: Evaluation focuses on whether the goal was achieved, not the exact path. Creating a calendar invite with correct parameters is the assertion. Domain knowledge is necessary. (57:23)

Ranjitha: Making a generic solution hasn’t happened yet because these tasks are very specific. (58:11)

Future of agents in engineering

Alexey: We prepared many questions, but we only covered a few today. It was super fun talking to you. You are very knowledgeable, and I learned new things. Time flew by. Thanks a lot for joining us today, and thanks everyone for joining as well. (58:17)

Ranjitha: It’s a good question, about replacing SEs with agents. Maybe if I were an SE, I would say, “We’ll get back to you. Watch our space at NewWord.ai.” (59:06)

Alexey: Thanks a lot. We’ll see you soon in our next events. (59:23)

DataTalks.Club