Wiki

Agent Engineering

Agent engineering across workflow design, tools, retrieval, evaluation, guardrails, and production constraints.

Related Wiki Pages

AI Engineer Role AI Engineering Roadmap LLM Production Patterns Retrieval-Augmented Generation LLM Evaluation Workflows Multi-Agent Systems AI Red Teaming Responsible AI and Governance Tools

Agent engineering is the practice of building AI systems that can pursue a goal and act inside a workflow. It extends prompt engineering with orchestration, tool use, retrieval, and memory. It also adds evaluation and production constraints. The cited examples include on-call assistants, email assistants, and coding agents. They also include enterprise search assistants, multi-agent support systems, and workflow automation.^[1]

Agent engineering sits next to AI Engineering and LLM Production Patterns, with Retrieval-Augmented Generation, LLM Evaluation Workflows, and Multi Agent Systems nearby. It’s narrower than AI engineering as a role, but broader than prompt writing. An agent can select steps, call tools, and keep task state.^[2]

Workflow Boundary

An agent is an LLM-backed system organized around an objective rather than only a single response. It can choose the next step in a task through orchestration and tool use. It may also use memory or knowledge stores. This differs from reinforcement learning, where an agent learns a policy from rewards inside an environment or simulator.^[1] ^[3]

The boundary with RAG matters because retrieval-only systems stop after they fetch context and generate an answer. Agentic systems add planning and tool calls. Retrieval can be one step inside a larger task.^[1]

Teams often start with RAG.^[4] They add tools and agent behavior when the task needs actions or durable memory. The LLM and RAG Production Roadmap turns that sequence into a production path. Later stages cover RAG rollout, evaluation, agent behavior, and operations.

Hugo Bowne-Anderson’s framing keeps that boundary practical. Start with a specific problem and try the smallest RAG or LLM workflow that can help. Add tools only when retrieval can’t answer.

Tool calls fit tasks that need an API call or current state. They also fit tasks that need an action ^[5]. Broad questions can also force the boundary. A RAG retriever may find relevant chunks. A summarization tool or sub-agent can still be better for a whole document, inbox, or course.

That extra power comes with more tool descriptions, tests, and traces.

Design Constraints

The shared definition doesn’t force one architecture because each setting has a different first constraint.

Operational agents start from integrations. An on-call agent needs logs and metrics before it can help with real incidents. It also needs permissioned tools and remediation options.

Ranjitha Kulkarni grounds this in Noird.ai’s on-call work. The agents reason over logs and metrics before suggesting or taking remediation steps.

^[6] ^[7]

The engineering challenge is making the agent behave like a constrained operator inside the incident workflow, not like a general chatbot with log access.

Teams adopting agents start from a narrow problem and keep the first version small enough to evaluate.

The email assistant starts with Gmail API access and RAG. The team chooses that boundary.^[4] Use AI tools for personal productivity for the personal productivity version of that boundary.

His four-step agent frame names the constraint. Define the problem, start small, make the data available, and decide how the team will evaluate the result. Without those four pieces, an agent can look impressive in chat while remaining hard to test or improve ^[8]. The same four pieces give AI Engineering Portfolios concrete evidence for agent projects.

Coordination style affects debugging, so teams decompose agents. Sequential flows are easier to review than manager-agent orchestration or direct collaboration. Lanham’s game-AI lineage behind that taxonomy is covered in Game AI to LLM Agents.^[3]

Enterprise teams start from governance. They need specialized models, guardrails, and data lineage before broad autonomy is safe. Multi-tenant evaluation, human-label alignment, and deployment risk matter too.^[9]

Agent Design

Agent design starts with the task boundary. A useful agent needs a concrete job, not a vague instruction to “be helpful.” Planning can be single-step, multi-pass, or self-reflective, so the system still needs limits on which tools it can call and when it should stop. ^[10] As the plan becomes dynamic, evaluation has to cover both final outcomes and tool-use behavior.

The implementation choice also changes the failure mode. Code agents can expose tool use and state through executable programs, while natural-language agents can be easier to prompt but harder to constrain and debug. That tradeoff links agent design to Software Engineering, Testing, and AI coding tools as much as to prompt writing.^[11]

Start with the smallest workflow that solves the task. Use task decomposition so the agent doesn’t become one broad prompt that owns every decision. Decomposition clarifies the orchestration choice.^[3] Options include sequential pipelines, manager-agent orchestration, and collaboration. For the narrower search family behind fitness functions, mutation, and candidate selection, use Evolutionary Algorithms. That distinction matters for Software Engineering because a linear workflow is easier to test and debug than an open-ended multi-agent system.

Agents create value when they act on documents, APIs, and workflow state. Chat alone isn’t the point. Teams use embedded agents for Slack-style work and proactive assistants. Agent design becomes a product workflow question as much as a model question.^[4]

The email-assistant example shows the same product boundary. A Gmail API plus RAG can answer and act on messages only after the team chooses the inbox state. The team also has to choose the retrieved knowledge and user permissions the assistant may use ^[12].

Tooling and Integration

Tools are part of the agent’s interface with the world, but bad tools create noisy context and unsafe actions. Good tools expose constrained actions, typed inputs, traceable outputs, and enforceable permissions.

Prompts, SDKs, and tool wrappers pair with integration abstractions for diverse tools, which puts agent tooling inside the broader LLM Tools for Real Products stack. Agent marketplaces and MCP-style protocols make tools discoverable and callable.^[1] These are Tools questions, but the agent page keeps the workflow-specific part. Tools should match the decisions the agent is allowed to make.

The Game AI to LLM Agents discussion places the OpenAI Agent SDK and MCP integration alongside scratchpads. It also mentions internal reasoning servers.^[3] The agent may need private reasoning state. It may also need task state that isn’t shown directly to the user. Engineers still need enough observability to debug the system. Hidden state shouldn’t replace tests.

Production AI engineering covers a browser extension architecture with a backend AI integration. It also covers search-focused assistants and tool selection.^[13] Those examples keep agent engineering close to normal application architecture. A tool call still needs a backend, authentication, latency control, and failure handling.

Iusztin places agents inside a broader full-stack AI engineer role. The agent is one system piece beside frontend, backend, databases, and RAG. Deployment and LLMOps sit in the same product path. That keeps agent work grounded in product ownership instead of a standalone demo ^[14].

Retrieval, Memory, and Context

Retrieval is one of the main tools agents use. It gives the system access to documents, logs, and emails. It can also expose code, tickets, and other external state. Guests don’t treat retrieval as automatic. They discuss chunking, metadata, wrappers, and failure analysis.

Context engineering is the design of effective LLM inputs. The RAG reality check is that latency, cost, and noisy context can break a system. Teams often need to rework retrieval backends, chunking, metadata, and wrappers so retrieved information fits the agent’s job.^[1]

For the broader retrieval architecture, see Retrieval-Augmented Generation.

Chunking strategies include fixed length and sliding windows. Retrieval memory is distinct from multi-turn conversation memory.^[15] That distinction is central to agent engineering. A support assistant may need durable customer facts and ticket history, while a coding agent may need repository context and task state. A short conversation memory isn’t enough for either system.

The first memory question is whether the workflow needs memory at all. If the agent only answers one request from supplied context, memory may add risk without adding value. If the agent has to remember preferences, prior decisions, or long-running task state, treat memory as a designed data source with its own evaluation cases.

RAG and knowledge management connect to the AI engineer skill stack.^[2] That link matters because many agent failures are knowledge-system failures. If teams don’t give source documents metadata, ownership, or a refresh cadence, the agent will act on weak context.

Evaluation and Testing

Agent evaluation checks whether the system accomplished the goal under realistic conditions. It can’t only compare one final string to a reference answer because multiple valid tool-call paths may exist.

Agent evaluation uses custom datasets, system benchmarks, and mocked tools. Teams also use integration tests and regression tests. The focus is goal-based evaluation and outcome assertions over exact paths.^[1] Those evaluation choices belong with LLM Evaluation Workflows and Testing.

Teams use generator-evaluator loops.^[4] They still weigh gold test sets against cost and representativeness. Failure analysis can change retrieval. Those evaluation habits are useful before a system becomes agentic, then become more important when the system starts calling tools.

Enterprise evals cover multi-tenancy and scale.^[9] Human-label alignment matters too.^[9]

Arize Phoenix appears as a monitoring tool.^[3]

These practices make evaluation part of daily agent work rather than a one-time benchmark.

Production and Governance

Production agents need permission checks, audit logs, lineage records, and guardrails. They also need monitoring and human review for high-risk actions. Cost and latency controls matter because tool calls, retrieval, and multi-step reasoning can multiply runtime.

Legal and healthcare agents put reliability, guardrails, and data lineage close to Agent MLOps. User feedback loops, infrastructure risks, and deployment risks belong in the same governance design.^[9]

Security overlaps here too.^[16] For agents, the same retrieval risk can become an action risk. That matters when the system can call tools, write data, send messages, or trigger workflows.

Teams keep agents governed by narrowing tool permissions and tracing the data used for each answer or action. They also keep human review around high-impact decisions and test failures repeatedly. Practical agent work therefore draws on AI Red Teaming, Responsible AI and Governance, Data Governance, and Production.

These discussions place agent engineering beside Responsible AI and Governance and Production.

The operational discipline of monitoring, governing, and deploying agents in production is covered as Agent Ops. The serving and release-ownership side of that work belongs with LLM Deployment.

Production AI engineering covers prompt evaluation and cost tradeoffs. Prompt compression and caching are model-efficiency tools.^[13] These techniques aren’t agent-specific. Agents make them more important because extra steps add tokens, latency, and failure modes.

Generic agent products miss details that live in each task. Teams need specific integrations, context, datasets, and evaluation.^[1] An agent should be designed around a real workflow and measured against that workflow.

Agent engineering connects these production and workflow pages.

DataTalks.Club