AI Dev Tools Zoomcamp: Learn AI-powered coding assistants and agents Register here!

Season 22, Episode 4

Build & Scale LLM Agents and RAG Pipelines: Prompting, Transcript Automation, Evaluation | Hugo Bowne-Anderson

Listen to or watch on your favorite platform

How do you move from prototypes to reliable, scalable LLM systems that actually deliver business value?

In this episode, Hugo Bowne‑Anderson—tracing a path from biology research into Python, PyData, DataCamp curriculum and product work, then into consulting, teaching, and developer relations—walks through practical engineering and evaluation patterns for building LLM-driven workflows.

We cover prompt engineering (role prompts, structured output, timestamps), everyday LLM use cases (summaries, translation, CSV workflows), transcript pipelines (Gemini, Descript, Loom) and automation with GitHub Actions. Hugo explains the generator–evaluator pattern for automated quality control, how to design evaluation sets and failure analysis, and techniques for logging, traces, and debuggable MVPs.

You’ll hear when to prioritize RAG (retrieval-augmented generation) and chunking strategies, when to add tool calls or agents, plus a concrete email assistant build using Gmail API + RAG. The episode closes with a four‑step framework for agents and guidance on retrieval‑based vs multi‑turn memory.

If you’re building LLM systems, this conversation gives actionable tactics for prompt engineering, evaluation, scaling transcript pipelines, and deciding when to adopt agents, embeddings, and automation.

Links:

Hugo Bowne-Anderson

About this Guest

Hugo Bowne-Anderson

Hugo Bowne-Anderson is Head of Developer Relations at Outerbounds, a company committed to building infrastructure that provides a solid foundation for machine learning applications of all shapes and sizes. He is also host of the industry podcast Vanishing Gradients. Hugo is a data scientist, educator, evangelist, content marketer, and data strategy consultant, with extensive experience at Coiled, a company that makes it simple for organizations to scale their data science seamlessly, and DataCamp, the online education platform for all things data. He also has experience teaching basic to advanced data science topics at institutions such as Yale University and Cold Spring Harbor Laboratory, conferences such as SciPy, PyCon, and ODSC and with organizations such as Data Carpentry. He has developed over 30 courses on the DataCamp platform, impacting over 2 million learners worldwide through his own courses. He also created the weekly data industry podcast DataFramed, which he hosted and produced for 2 years. He is committed to spreading data skills, access to data science tooling, and open source software, both for individuals and the enterprise.

LinkedIn X Hugo Bowne-Anderson on DataTalks.Club

You'll be subscribed to our newsletter and receive a Slack invite in 3 minutes.

Click any timestamp to jump to that moment in the video

The transcripts are edited for clarity, sometimes with AI. If you notice any incorrect information, let us know.

Episode Introduction & Guest Bio

Alexey: This week we will talk about LLMs and AI like everyone else, I guess. (0.0)

Alexey: The difference between everyone else and us is that we have amazing Hugo joining us. You might have figured it out from our discussion that he is a returning guest and also a teacher. I have a video that I will read, I guess. Hugo is an independent data and AI consultant with extensive expertise in the tech industry. He has advised and taught teams building AI powered systems including engineers from Netflix, Meta, and the US Air Force. That is quite far from Australia, isn’t it? (0.0)

Hugo: You can still teach online these days, thank goodness. You do not need to go to the US to teach them. (45.0)

Alexey: That is good. He is also the host of Vanishing Radiance. I think your background is a hint about that and High Signal. I have no idea what High Signal is. You will probably tell us. It is a podcast exploring developments in data science. Why do you need two podcasts, by the way? (50.0)

Podcasts Overview: Vanishing Gradients and High Signal

Hugo: One, Vanishing Gradients, is really for builders who want to know what they can learn today to ship and maintain products. High Signal is a lot of conversations with people in leadership. We have had Michael Jordan on High Signal and Fay Lee and Chris Wiggins who runs data science at the New York Times. It is really intended for leaders who are also practitioners but thinking at a higher level. (1:12)

Alexey: Which Michael Jordan are you talking about? (1:40)

Hugo: The Michael Jordan of machine learning. He is from Berkeley originally but lives in Italy now. (1:45)

Alexey: He has a book about machine learning, right? (1:53)

Hugo: Yes, exactly. He has done a lot of stuff. (1:55)

Alexey: Is it called Machine Learning or something else? (1:57)

Hugo: I would need to fact check myself. (1:58)

Career Journey: Academia to Data science and DevRel

Alexey: I remember when I was studying machine learning in 2012 or 2013. I thought, “Oh, Michael Jordan, interesting.” Later, after some time, I realized which one you meant. Tell us about your career journey so far. We remember that you were doing DevRel. We also talked about your teaching experience, but can you give us the full picture? (2:04)

Hugo: My background is in basic research in biology, mathematics, and physics. I lived in Dresden, close to you, for a couple of years. There is a Max Planck Institute for Cell Biology and Genetics there where I did part of my postdoc over a decade ago. (2:41)

Alexey: Over a decade ago. So that is how you know a bit of German, right? When I was eating, you said “gut” and a bit. (2:52)

Hugo: Exactly. I realized I needed to analyze large scale data that my colleagues were generating, so I taught myself Python. I jumped into what were then called IPython notebooks, not Jupyter notebooks, and the growing PyData stack. Pandas and Matplotlib were key tools. At that point pandas did not even have a read CSV function. When it was added, it really started to super power us. (3:07)

Hugo: Then I moved to the US and lived in New York while working at Yale University in New Haven. There were so many data science and machine learning meetups and hackathons in 2014 and 2015 in New York. It was so exciting that I decided to join industry. I met the DataCamp team and joined as the fifth employee to build out the Python curriculum. As an early employee I wore many hats including product, data science, curriculum, and marketing. (3:27)

Freelance Consulting, Advising, and Teaching Focus

Hugo: I started a podcast there called The DataFrame Podcast which was super fun. During the pandemic I worked at several PyData adjacent open source tooling companies leading developer relations. Then last year the space became too exciting for me to focus on one project. I went freelance doing consulting, advising, teaching, and developer relations across the board. It is an exciting time, and I get to collaborate and work with many wonderful people such as yourself now. (3:57)

Alexey: That is really cool. Your journey from academia to developer relations is exciting. For me, doing developer relations sounds cool because you get to teach people and receive good money for it. I am originally from Russia, and being a teacher there is not always the best paid profession. It is cool that there is an opportunity for people to educate and also receive good pay from the IT sector. (4:38)

Alexey: Since then you switched focus to consulting. How did that happen? It has been a few years, right? About a year and a half? You were at Outerbounds doing developer relations. That is the orchestrator platform, right? (5:27)

Hugo: Metaflow. (5:45)

Alexey: Metaflow, yes. I remember it was something with “flow.” (5:50)

Hugo: There are too many flows. (5:52)

Alexey: Exactly. I was thinking which one it was. It was Metaflow. You did developer relations there. What happened after that? Did you decide you wanted to consult? (5:53)

Hugo: Exactly. I wanted to consult, do a lot of podcasting and education, and help people ship and build products in a variety of ways. In my consulting I also do a lot of internal training, both on the technical side and the executive side. I teach people how to leverage AI to build software and also advise executives. It is a wild time and there is more work than I can do myself, so it is definitely exciting. (6:08)

Alexey: What kind of work do you do? (6:38)

Hugo: Everything from consulting to advising and teaching, as well as developer relations. (6:44)

Alexey: Consulting, teaching, I am just taking notes to ask you later. Consulting, teaching, developer relations. What was the last thing? (6:55)

Hugo: Advisory work as well. (7:03)

Alexey: What is the difference between advisory and consulting? (7:09)

Consulting vs Advisory: Hands-on Work and Organizational Advice

Hugo: In my consulting work I really help people ship products and build. (7:11)

Alexey: So you basically open your VS Code and do OpenAI work. You actually code. Consulting is hands on, and advisory is more like, “This is how you do it, now you go do it,” right? (7:15)

Hugo: Exactly, and even advising nontechnical people about how to restructure their organizations around AI tools and similar topics. (7:27)

Alexey: Can you give an example? Why would they need to restructure? For example, to have data engineers helping machine learning people? (7:35)

Hugo: For example, take a nontechnical marketing team. You have three individuals using ChatGPT, each writing their own prompts, and there are no synergistic effects. Even something basic like having a Slack channel where they can share prompts or a version controlled system for sharing prompts that work best can make a difference. Thinking about small technological tools that allow teams to work together with AI systems is key. (7:46)

Incentivizing AI Adoption: Loss Aversion and Dedicated Experimentation Time

Hugo: There is interesting research on loss aversion. If you frame not using AI as a potential loss rather than using it as a gain, it can incentivize people more. Another thing that is important is that the most successful organizations adopting AI carve out time and space for people to use it. Instead of expecting efficiency gains on top of full time work, they dedicate half a day a week for employees to experiment with AI. It brings short term loss but medium and long term gains. (8:24)

Alexey: Are there people who still have not tried it? (9:20)

Hugo: A handful, not many. (9:28)

Alexey: Everyone is talking about this. Two years ago someone on our Slack channel shared a screenshot of ChatGPT writing a poem about our machine learning engineering course. It was like, “Wow, a poem about the course, how cool is that.” Now everyone uses it. My mom uses it, though she is pretty advanced, but everyone does. (9:28)

Hugo: My mom uses it, my dad has not. He knows about it and has seen screenshots. Many people who use it think it is just for conversations. They do not realize they can use it for transcript summarization, translation, content generation, idea generation, or filtering ideas. You can upload a document and have it summarized, although sometimes it pretends to summarize without really doing it. (10:04)

Alexey: That is how I use it now, not necessarily for summarization, but for simple data processing with Excel or CSV files. I just upload and ask ChatGPT to do this or that for me and plot a graph. I can do it myself, but it saves so much time. I do not even need to write code anymore for these simple things. That is cool. (10:46)

Practical Prompting Use Cases: Summaries, CSVs, and Role-Based Prompts

Hugo: Exactly. Many people do not realize you can generate CSV files or PDFs with it. Helping people understand how to prompt is also key. If you give it a role and an objective, like “You are a chief marketing officer writing this campaign,” and add examples and heuristics, it becomes incredibly useful. (11:11)

Alexey: Let us say I want to create timestamps for YouTube. I guess you also do this. What is the most effective prompt for that? I have a transcript, we edit it slightly, and YouTube generates subtitles. I copy them to ChatGPT and say, “Give me YouTube time codes.” I specify the format, time code and chapter name, and that is it. How can I improve this process? (11:45)

Timestamp Generation Tools & Workflows (Gemini, Descript)

Hugo: I use Gemini because I do it programmatically as well. Gemini can get data directly from YouTube, and ChatGPT cannot because Google has blocked it from transcribing. (12:22)

Alexey: That is why I copy and paste things. (12:42)

Hugo: Gemini takes it straight in. Tell it not to hallucinate timestamps and to double check its work. If it is long, ask it to chunk the content. For videos I avoid all that. I use Descript. It generates excellent timestamps for me. If it is not broken, do not fix it. (12:45)

Alexey: I use Loom for recording videos, and the prompt they have for timestamps needs improvement. If you were consulting a company like Loom, where they need to create timestamps, how would you suggest they approach it? (13:14)

Hugo: We would look at the prompt and iterate on it several times. For this kind of prompting you would say, “Have an eye for detail.” You would also want timestamps relevant to your audience. For this podcast, maybe data science people, machine learning engineers, and AI engineers. The timestamps need to attract them. (13:33)

Quality Control Pattern: Evaluator–Optimizer for Generated Outputs

Hugo: You would also want a subject matter expert in the loop to evaluate the results. Whoever used to write timestamps should be involved. Another effective approach is the evaluator optimizer pattern. You can have one model generate timestamps and another evaluate them, scoring the output and sending it back if it fails. (13:56)

Alexey: How does it work? You have two models or agents, one evaluator and one optimizer. The evaluator looks at the output and gives it a score from zero to ten. (14:49)

Hugo: You can just give it zero or one, pass or fail, and feedback. (15:10)

Alexey: So the evaluator should have multiple criteria, right? Say ten criteria, each pass or fail. (15:17)

Hugo: In the end you just want pass or fail, because either it goes to production or not. (15:22)

Alexey: I see. With timestamps it is tricky because there are many correct answers. There are so many ways to split videos into chapters. How do you know which one is better? (15:32)

Hugo: You want it aligned with your judgment, when Alexey sees it and says, “I like that.” You can give it examples and heuristics. For example, Alexey may dislike timestamps ten seconds apart. I would hate that. You can give it a heuristic, like if it is an hour long there should be only ten to fifteen timestamps. (15:58)

Hugo: That is why prompt iteration, or prompt engineering, is important. It tries something, you review, then adjust. Maybe it starts using emojis, then you add “Do not use emojis.” After fifteen or twenty prompts you will have something that looks good for most videos. (16:36)

Alexey: That is how I do it. I open ChatGPT, copy the transcript, play with the prompt until I like the result, and hopefully save the prompt, though usually I forget and start again next time. I check the output, copy the chapters to YouTube, and my job is done. (17:02)

Scaling Transcript Work: Automation and GitHub Actions

Alexey: If we have hundreds of transcripts, optimizing for one may worsen others, and I cannot check all one hundred. That is why we need an evaluator. I want correct transcripts and timestamps at least five minutes apart. (17:38)

Hugo: This is amazing because we are hitting on what it is like to build and iterate on AI powered software. First, you look at your data and see what works and what does not, then iterate. You are iterating on a prompt. If you are building retrieval augmented generation, maybe you iterate on chunking or embeddings. (18:10)

Hugo: If you are building agentic systems, maybe you iterate on tool call descriptions. You look at the data to see what happens next. The way you describe it, you iterate and then throw the prompt away and start again. You also mentioned wanting it to work across one hundred things. (18:50)

Alexey: If I build a product, one thing is doing it once and throwing it away. Another is keeping the prompt, reusing it, and giving it to my assistant so she can use it. That way I can be sure the transcript quality stays good because we have tested it. (19:12)

Hugo: Exactly. You do not want to overfit to the transcripts you have already tested. If your new transcript is similar, like another machine learning podcast, it will likely perform well. If it is something different, like carpentry, maybe adjust. But if you iterate enough and it performs well across a diverse set, it will generalize and save time. (19:40)

Hugo: We are building a pipeline that takes in transcripts and uses GitHub Actions to generate timestamps and other outputs, automating what I used to do by hand. (20:42)

Alexey: Yeah, that is what we do too. But this GitHub Actions is only for processing transcripts. Still, it is cool because it is free, right? (21:01)

Hugo: Yeah. Exactly. (21:14)

Alexey: How do you iterate 500 times on a prompt? You said some people iterate a few times, some a few hundred, and some a few thousand. For me, the way I come up with prompts is that I first type them myself. Then I add more and more details, and at some point, I ask ChatGPT, “Hey, this is the prompt I have. It cannot do this or that,” and I describe the cases it cannot handle. (21:15)

Alexey: The GPT5 version of the model is pretty good at creating very specific prompts that usually solve my problems. Then I have this prompt, and it kind of works. Maybe I do ten or fifteen iterations at most. But how do I go from that to 500? Is it humanly possible? (21:47)

Hugo: Yeah. It can take weeks. But this is iterating on prompts that ship SaaS software to tens or hundreds of thousands of people. I also use LLMs to write prompts. Prompts usually perform better when written or at least edited by humans. (22:06)

Hugo: As I said, LLMs will insert emojis into prompts all the time. There is no reason for that. (22:31)

Alexey: I hate formatting when it adds random stuff. It just increases the cost of my prompts. (22:44)

Hugo: Exactly. Maybe that is why they are doing it. Who knows? Demand generation. (22:49)

Gold Test Sets: Size, Cost, and Representativeness for Evaluation

Alexey: If you want to iterate 500 times on a prompt, you need to have a proper evaluation set, right? (23:00)

Hugo: That is a good question. Not necessarily at the beginning, but in the end, yes, especially when shipping reliable and consistent software. A lot of the time, you can just vibe check it. They call it vibe checking when you look at the result and think, the date format is not correct, so I will tell the prompt or use structured outputs to fix it. (23:05)

Hugo: A lot of things you can eyeball. In the end, though, you definitely want some kind of gold test set. Just as in machine learning, we have a hold out set and a test set. The premise is the same. The practice is a bit different because of rich natural language and tool calls. (23:35)

Alexey: How large should the evaluation data set or gold test set be? The problem is that unlike traditional machine learning, it might take a lot longer to evaluate. Even with deep learning, maybe it takes half a minute, but with prompt evaluation, it takes much longer. How large should the data set be? (23:56)

Alexey: We want the data set to be large, but if it is too large, it costs money and takes time. So we want it to be as small as possible but not too small. Otherwise, we might overfit to just a few examples. (24:18)

Hugo: It really depends on what you are working on. You want it to be representative of what user interactions will be like. About the cost, I totally agree, but you do not need to use an LLM judge for everything. An LLM judge is just another prompt you use to judge one. (24:39)

Hugo: You can test whether it produces structured output or use regular expressions or string matching, depending on your use case. You can lower costs by using cheaper models or flash models instead of the most powerful ones. The key is that the test set should be representative. (24:59)

Hugo: When you start looking at your data in spreadsheets, you start to see patterns emerge. You can see failure modes and get a sense of how much you need to collect and how big the test set should be. (25:25)

Alexey: The reason I am trying to pull as much information from you as possible is because for my course that I am doing now, it is the evaluations week. I am thinking about what else I should include there. I already included the usual testing like format adherence and whether references are included. (25:57)

Alexey: There are also integration tests where we run a set of data and see the output. Then there is the performance test, like out of 100 questions, 90 percent got relevant responses. (26:25)

Failure Analysis: Categorizing Errors and Prioritizing Retrieval Fixes

Hugo: You use this process to guide your development as well. This is something I teach a lot in my work and courses. You should do failure analysis and rank order your failures. Formatting issues might be more obvious but if they are only 10 percent of the issues, and most are retrieval errors, focus on retrieval. (26:43)

Hugo: Doing some failure analysis, categorizing errors in a spreadsheet, then doing a pivot table to rank order them will help. You will see that most errors come from retrieval, and that is where you should focus, not on the generative part. (27:20)

Alexey: Makes sense. Do you use any software tools for evaluation and monitoring? (27:30)

Monitoring & Vibe Coding: Logging, Traces, and Debuggable MVPs

Alexey: We have many monitoring and evaluation tools to help with this process. I output things to pandas, copy them to Google Sheets, and analyze there. But there are special tools. Have you tried any of them? Do you like any? (27:38)

Hugo: I like most of them, but my favorite tool besides spreadsheets is vibe coding. Let’s say I have an email assistant I am building. Then I can vibe code what it looks like to interact with the email assistant and look at all the traces and function calls there. (28:02)

Hugo: That can be incredibly useful. Of course, use tools that make your job easier, whether it be Logfire, Braintrust, or Phoenix Arise. These are all wonderful tools. Generative AI is one of the most civilization changing technologies, bigger than the internet in many ways. (28:39)

Hugo: It will take years for us to see the full effect. It is a horizontal technology with people building application layers on top of it in healthcare, finance, and education. People are also building tools to help others use it, but they cannot satisfy everyone. (29:03)

Hugo: Think about it. React came out more than ten years after the internet. The tools that will really help builders in the future may not even exist yet. The current tools are fantastic, but for the entire application layer we still need more. (29:22)

Alexey: When you vibe code these things, you said you include traces and function calls. So basically you create a React app or Streamlit app that looks like the end product but add debugging tools to see what the function calls are, right? (29:55)

Hugo: That is exactly it. Whenever you build an MVP, build logging into it immediately. Log everything and then you can vibe code. (30:12)

Hugo: When vibe coding, chat with it first before building. Give it your database schema and as much information as possible. AI assistants are like very bright interns who sometimes forget things you want them to do. (30:24)

Hugo: My friend John Barryman, who worked on Copilot at GitHub, always says to have empathy for your AI assistant. Understand its strengths and weaknesses. Have a conversation with it first, develop a short requirement document, and remind it of the database schema. (30:50)

Alexey: Yeah. Sometimes it is stubborn or refuses to do things. I ask it to edit a file and it refuses, so I do it myself, and then it works. Sometimes it is amazing. I ask it to edit a file and it turns a Jupyter notebook into a clean markdown file. (31:22)

Hugo: Totally. We talk about hallucinations a lot, but not enough about forgetting. State of the art tool calling by an LLM as an agent is about 90 percent accurate. If you have six or seven tool calls in a row, it will only work about half of the time. You just need to understand what you are working with. (31:56)

Alexey: What kind of assistants do you use? My current favorite is GitHub Copilot. I have tried many tools. I use it because I have an open source license, so it is free for me. (32:13)

Alexey: I do not need to pay twenty dollars for Cursor. But I am curious what your favorite one is right now. I know it might change in a week. (32:32)

Hugo: I use Cursor and have for some time. I have used Claude Code, Copilot, Windsurf, and played with AMP. Cursor does everything I need and does it well. (32:38)

Hugo: It is mostly inertia that keeps me from switching. I am doing so many things that I have no strong reason to change now. But I should make time to experiment with new tools. (33:03)

Embedded Agents and IDE Integrations: Cursor, Copilot, and Slack Workflows

Hugo: I am also excited about having these tools in normal interfaces. Some people use Cursor and Devon in Slack. You can be in Slack and say, “This documentation is wrong, update it.” (33:14)

Hugo: Bringing agentic systems into our normal environments will become common. Manis has an email assistant where you can tag it in an email to update documentation or perform tasks. Having these assistants around more often will become normal. (33:35)

Alexey: Slack. (33:59)

Hugo: What is that? (34:00)

Alexey: How can you use Cursor from Slack? (34:01)

Hugo: I think there is a Slack integration. I would need to check, but I am pretty sure. (34:02)

Alexey: It is like a Visual Studio clone, right? They have a CLI application. That is what I like about GitHub Copilot. (34:06)

Alexey: They have strong GitHub integration. I can create an issue, assign it to GitHub Copilot, and in half an hour have a working result. With Cursor, I cannot do that or do not know how. (34:18)

Alexey: But now you say there is Slack integration, which means I can write, “Hey Cursor, can you fix issue number 124?” and it will pull the issue and solve it, right? (34:36)

Hugo: Exactly. That is cool. I think they do have it. I have not used it much. It might not be as advanced as Claude Code yet. (34:53)

Hugo: A year or two ago, we were copying and pasting between ChatGPT and our IDE. Then we got code completion. Now we have agents in IDEs and terminals. (35:13)

Alexey: Coding was also not very good then. GPT 3.5 was not great at coding. (35:19)

Hugo: Yeah. And it did not even know its own API. Then we got code completion. Now we have agents in IDEs and proactive AI doing code reviews and continuous integration. (35:27)

Hugo: We are seeing more background agents working quietly. I am excited about proactive AI that tells you when something happens in production or organizes your work schedule. (35:45)

Alexey: So we are at the background agents level now, right? I can create an issue in GitHub, assign it to Copilot, and half an hour later there is a pull request ready. But what if Copilot emailed me saying, “Hey, you could add this feature. Do you want me to work on that?” That would be amazing. (36:24)

Hugo: Proactive agents and multiplayer agents are the next step. When multiple people ping the same one, it gets confused. We might need to fine tune models for multiplayer conversations. (37:13)

Hugo: You could imagine several people talking to an agent in Slack, and it acting as a team member. Once we solve that, having agents as embedded team members will be normal. Proactive and multiplayer agents will be a big part of the future. (37:43)

Hugo: If things changed this much in three years, I am curious to see what will happen in the next three. It might slow down over time, but still be exciting. (38:09)

Hugo: Look at the evolution of the internet. It grew fast, then slowed down. Then companies like Google and YouTube emerged, connecting and indexing the internet. Similar products will emerge for AI. (38:28)

Alexey: There is this browser from Perplexity. You have heard of it, right? (39:03)

Hugo: Yeah. Comet. (39:10)

Alexey: Sometimes I get an email from my tax advisor saying I need to prepare some documents. When it comes to taxes, I am such a procrastinator. You can probably relate. (39:16)

Hugo: I cannot even talk about it right now. (39:32)

Alexey: Exactly. She sent the email a month ago and I know I need to do it, but it takes time. It is manual work: downloading documents, saving them in folders, and sending them back. I would love to wake up one day and have an agent do that for me. (39:32)

Hugo: Absolutely. (40:06)

Alexey: That would be amazing because I think Comet is one step toward that. (40:07)

Agentic Value Beyond Chat: Actions, Documents, and Automation

Hugo: Definitely a provocative statement, but I do not think the future of AI happens in chat. We will still chat and do that kind of stuff, but the amount of value a conversation can generate is limited by human time, and human time is scarce. The real value comes from generating documents, taking actions, sending emails, and other agentic capabilities. That is where over ninety-nine percent of the economic value will be delivered. (40:12)

Alexey: Let’s see. So what else do you consult about? What kind of applications do you build or help your customers build? (40:47)

Hugo: The main focus is retrieval, though some agentic work as well. It often comes down to setting expectations. Let me share an example that combines several clients. Many teams get stuck in what I call “proof of concept purgatory.” That is where great ideas exist but they do not solve any real problems. (40:58)

Hugo: For example, an edtech company wanted to build an all-purpose AI tutor that could teach anyone anything. It sounded great, but we realized that while models can appear to do that, they cannot do it consistently or reliably. (41:29)

Alexey: I have used it to learn many things. (42:02)

Hugo: Yes, and you are right. For topics it has been trained on, it usually performs well. (42:10)

Alexey: Right. But of course if I go beyond common human knowledge or into research-level topics, it struggles. I learned basics of chemistry, electronics, and even German using ChatGPT. But at a PhD or advanced research level, it cannot yet do original work or independent research. (42:15)

Hugo: They can search the internet using tools, but not always reliably. Sometimes they say they searched when they actually did not. Even at a college level, for subjects like algebraic representation theory, they can hallucinate. They also make mistakes with basic geometry like Platonic solids, sometimes inventing new shapes that do not exist. So I do not think ChatGPT is ready to be an all-purpose tutor yet. (42:41)

Alexey: So you mean it struggles with visual or structured reasoning tasks, right? (43:39)

Hugo: Yes, and even arithmetic. There are many failure modes. That is why I would not trust it as a general tutor. (43:46)

Hugo: Anyway, that edtech company wanted a flashy AI tutor, but when we looked into their customer support tickets, we found that twenty percent were simply questions like “Which class is this lesson in?” or “Where can I learn about this?” (43:57)

Prioritizing RAG for Business Impact: Quick Wins over Moonshots

Hugo: So instead of building a moonshot tutor, if they just built a simple RAG bot with good chunking, embeddings, and a nice interface, they could solve one in five support tickets immediately. It is less flashy, but delivers real business value fast. Helping people see that is important. (44:26)

Alexey: Is chunking a solved problem, or do we still not know how to do it properly? (44:56)

Hugo: Generally, you can do it relatively well, but it depends on the data. For example, with a transcript like this one, you could chunk by question and answer pairs or by speaker turns. If there were five people talking, you might chunk by topic instead. It depends entirely on the structure of the content. (45:01)

Alexey: So there is no one-size-fits-all. For a podcast, it makes sense to chunk by Q&A pairs, but for a book like Game of Thrones you would use a different approach. (45:49)

Hugo: Exactly. Also look at the data you get. Zoom, for example, gives you two transcripts closed captions without names, and a full transcript with speaker names. The second one is much richer and more useful for chunking. (46:08)

Alexey: Funny enough, ChatGPT can often infer who is speaking even without names. (46:32)

Hugo: True. But sometimes you do not even need to chunk. Context windows are getting large, though that comes with context rot. Jeff Huber and his team at Chroma wrote a great essay about context rot how giving too much context can reduce precision and relevance. (46:39)

Alexey: Yes, I notice that with long transcripts. At the start it reformats things nicely, but by the end it gets sloppy. (47:10)

Hugo: Exactly. In prompt engineering, if something is really important, say it at the start of your prompt and repeat it at the end. You will often get better results. (47:26)

Alexey: Yes, when I ask it to improve my prompt, it often repeats my intent at the end. That seems to help it follow instructions better. (47:40)

Alexey: So when it comes to chunking, which approach did you use for that tutor? (48:07)

Hugo: That project was a mix of several experiences. But the main rule is: study your data. If you have 30-minute instructional videos, maybe chunk every five minutes or wherever topics shift. (48:20)

Alexey: Would you still use character-based chunking or rely on section splits? (48:50)

Hugo: I would start with fixed character-length chunks and refine from there. (48:57)

Alexey: It seems to work well with a sliding window approach. More complex methods often overcomplicate things. I wonder what the industry consensus is. (49:02)

Chunking Strategies and Context Rot: Sliding Windows and Summarizers

Hugo: You are right. Also, RAG struggles with certain questions like “What is this whole video about?” because it looks for individual chunks. That is where agents and tool calls can help. For example, a summarization tool can answer broad questions that a RAG system cannot. (49:21)

Hugo: Even a few simple sub-agents can make it much more powerful and natural for the kinds of questions humans ask. (49:55)

Adding Tooling vs. Simplicity: When to Move from RAG to Agents

Alexey: But the moment we go from RAG to agents, system complexity increases a lot. We must define tool calls carefully, repeat instructions to ensure correct use, and handle more errors. While RAG follows a fixed path and is predictable. So when should someone actually consider using agents? (50:19)

Hugo: If your system works well, there is no need to add complexity. But if users start asking questions like “Tell me about the entire corpus,” then you might add a summarization tool. Tool calls increase complexity but also power and scope. Introduce them mindfully and always evaluate them using tools like Braintrust, Arise, or Logfire. (51:10)

Hugo: You can also vibe code some things to see what is really happening before making things too complex. (52:06)

Alexey: Braintrust? I have not heard about that one. (52:17)

Hugo: Yes, it is braintrust.dev. There are two with similar names, so look for “Braintrust LLM.” (52:23)

Alexey: Got it. I am checking now. It is not open source, right? (52:30)

Hugo: They do have a GitHub, actually. (52:38)

Alexey: Okay, I will check it out. These tools appear faster than mushrooms in Germany in October—it is hard to keep track. (52:44)

Hugo: Yes, and many of them combine open source and commercial parts. (53:01)

Practical Project Example: Building an Email Assistant

Alexey: Like Arise with Phoenix, right? Okay, we should wrap up. Maybe share some advice for people starting with agents. What should they focus on first? (53:09)

Hugo: A few things. First, build something meaningful to you. For example, if you are overwhelmed by emails, connect to the Gmail API and start clustering or classifying emails by priority. Use some basic machine learning and iterate. (53:34)

Alexey: You connected it to Gmail through the API? (54:03)

Hugo: Yes, absolutely. Cluster and classify your emails, then maybe build a model that prioritizes them. You could even connect another LLM to draft suggested responses for you to review and send. (54:03)

Alexey: That is exactly what I want. A system that looks at unread emails and suggests two or three replies. (54:31)

Hugo: Fantastic. You should build that. (54:37)

Alexey: Yes, because I procrastinate on so many emails especially when I read them on my phone. If a system generated responses I could just approve, I would probably reply to half of them immediately. (54:44)

Hugo: Exactly. (55:08)

Alexey: If anyone is building this, please reach out to me. (55:09)

Hugo: Totally. And to your point, start building things that solve your own problems or your company’s problems. That is the best way to learn. (55:14)

Four-Step Framework for Building Agents (Problem, Start Small, Data, Evaluation)

Hugo: In short, four steps: (56:21)

Hugo: First, find a meaningful problem, like your email example. (56:21)

Hugo: Second, start small use an LLM and add tools or memory only as needed. (56:21)

Hugo: Third, look at your data carefully. (56:21)

Hugo: Fourth, build a basic evaluation set to guide iteration. (56:21)

Hugo: Often the issues are not embeddings or chunking but OCR or data ingestion. Also, if you want to go deeper, I teach a course on building AI applications for data scientists and engineers. I can offer your community a 20% discount. (56:53)

Alexey: Great. One quick last question when you mentioned adding memory, how do people usually do that? I have only used retrieval-based approaches so far. (57:18)

Memory Design: Multi-Turn Conversation Memory vs. Retrieval-Based Memory

Hugo: Good question. Many people handle memory poorly. GPT-5 itself has changed how it handles memory; long conversations now work differently. You should first ask if you even need memory. Many systems are single-turn and do not require it at all. (57:41)

Hugo: Memory becomes important once you need multi-turn conversations. (58:59)

Alexey: Right, like in the email assistant idea it would need to remember how I responded before. That could be retrieval-based, though. (59:05)

Hugo: Exactly. In that case, retrieval is your memory. But when I say memory, I mean within an active conversation like when you tell a travel assistant “Find me a hotel,” and later ask, “Which is the cheapest?” It should remember that context. (59:28)

Alexey: I see. I was thinking of memories across conversations. (59:57)

Hugo: That is another challenge entirely. For short conversations, you can just include the entire chat history as input. For longer ones, you might use a sliding window or a separate agent to manage memory. It decides what information is relevant to keep. Evaluating multi-turn conversations is also tricky, so define the desired conversation length early on. (59:59)

Episode Wrap-Up and Next Steps

Alexey: Yes, it makes sense. I still have many questions, but we should wrap up. (1:00:55)

Hugo: We should chat again soon. Maybe we can run a workshop or teach something together. That would be fun. (1:01:01)

Alexey: That could be. My son wants attention now. (1:01:13)

Hugo: Then you should definitely go. (1:01:16)

Alexey: Thanks a lot, Hugo. Let’s catch up again soon, no need to wait another three years. (1:01:19)

Hugo: Exactly. Thanks everyone for watching and listening, and thanks Alexey and everyone at DataTalks as well. (1:01:25)

Alexey: All right. Ciao. (1:01:30)

DataTalks.Club. Hosted on GitHub Pages. We use cookies.