AI Engineering Roadmap

A roadmap for learning AI engineering through software foundations, LLM applications, RAG, evaluation, agents, LLMOps, and production ownership.

Related Wiki Pages

AI Engineering AI Engineer Role LLM Production Patterns Retrieval-Augmented Generation Agent Engineering LLM Evaluation Workflows Agent Ops AI Infrastructure MLOps

An AI engineering roadmap gives learners a sequence for building software around models. It also helps them prove that the software behaves well enough for real users. Use AI Engineering for the discipline map and AI Engineer Role for title boundaries. The roadmap owns the order of study, project progression, and career transition path.

The sequence starts with product software before it adds LLMs and retrieval-augmented generation. Later stages add LLM evaluation workflows, agent engineering, and production operation.

Paul Iusztin grounds that order by putting product work and RAG in one shipping path. He also adds agents, evaluation, and LLMOps ^[1]. Ruslan Shchuchkin adds the project signal: usable applications need product discovery and context management, not only model calls ^[2].

From Product App To AI System

The first learning step is still product software. Ship a small interface or API before adding retrieval, agents, deployment, and evaluation. Paul’s skill-stack discussion supports that order because application work comes before the later AI-specific layers ^[1].

Ruslan’s BranchGPT example keeps the same sequence grounded in a concrete application. The product starts as a web application, then adds context management and user behavior in ^[2]. Nasser Qadri keeps precision, recall, and accuracy in view when generative AI systems replace older ML workflows ^[3].

That makes the order clear. Learn software engineering first, then prompt engineering and Retrieval-Augmented Generation. After that, add LLM Evaluation Workflows, LLM Production Patterns, and MLOps.

Entry Paths Into The Roadmap

Different learners can enter the same sequence from different strengths:

Full-stack builders can start at Stage 1, then add RAG, agents, evaluation, and monitoring as product features (^[1]).
Product and domain switchers can start with Stage 1 and Stage 2. Then use AI Engineer Role and nontraditional paths to AI engineering for AI product proof (^[2]) (^[4]).
Data-science learners should keep Nasser’s metric and domain-evaluation discipline visible while moving through the LLM, RAG, and evaluation stages (^[3]).
Agent-focused learners shouldn’t skip prompts, structured outputs, gold tests, traces, and RAG before adding tools and memory (^[5]) (^[6]).
Production-focused learners should deepen the later stages with data trust, pipeline tests, and prompt evaluation. Caching, compression, and latency control belong there too (^[7]).

Stage 1: Build Normal Software

Start with ordinary application engineering by building a small service and one interface or API. Finish this stage with persistence, tests, and LLM deployment. Add a basic monitoring path before complex AI architecture. Paul’s roadmap keeps product shipping and application layers inside the AI engineering stack. Databases, deployment, and monitoring belong there too (^[8]).

Ruslan’s BranchGPT example shows why this stage comes first. The project needed application structure and context-management behavior, not only a model call (^[2]).

For this stage, use Notebook to Production AI Systems and AI Infrastructure. Use machine learning for software engineers for the software foundation. When learners need codebase-aware help, use AI coding tools. Keep the proof in tests, diffs, and working product behavior rather than in the prompt (^[7]).

Stage 2: Add LLM Calls and Structured Outputs

After the application shell works, add model calls and make the model output inspectable. Hugo uses everyday LLM tasks and role prompts as an early practical path. He also covers transcript workflows, structured outputs, and traces ^[5].

Build a narrow product, not a generic chatbot. The learner should show the user task, the prompt or message format, the expected output format, and the failure cases. Ruslan’s daily-life project advice and hiring signals support that project-first standard (^[2]). Use AI Engineering Portfolios when this stage needs examples that make those project signals reviewable.

The same project boundary shows up in LLM system design interview practice. Candidates need to explain the user task, source of truth, and context. They also need failure modes and operating constraints. For tool choices, connect this stage to LLM Tools and Prompt Engineering.

Stage 3: Build Evaluation Before More Architecture

Create a representative test set and define pass/fail criteria, then categorize errors before adding retrieval, agents, or fine-tuning. Paul calls evaluation one of the technical pillars for shipping AI products (^[1]). Nasser’s metric framing keeps precision, recall, and accuracy in view for generative systems (^[3]).

Hugo adds gold tests, traces, and failure analysis to early LLM engineering. They belong there, not only after production launch (^[5]). Use LLM Evaluation Workflows and Evaluation for the detailed evaluation mechanics.

Stage 4: Add Retrieval When Knowledge Is the Bottleneck

Add RAG when the product needs changing knowledge, private documents, citations, or auditable source context. Paul puts RAG and knowledge management inside the AI engineer stack ^[9].

Meryem Arik draws the production boundary between retrieval and fine-tuning. She also compares open-source models with hosted APIs ^[10]. Atita Arora explains RAG as retrieval plus generation. She then covers chunking, citations, and human-in-the-loop evaluation ^[11].

A good roadmap project at this stage includes ingestion, chunking, and metadata. It also includes embeddings, retrieval, citations, and retrieval failure analysis. Use the LLM and RAG production roadmap when the work needs a build sequence from retrieval to evaluation and operating controls. Compare the choices through RAG vs Fine-Tuning, Search and RAG Project Checklist, and RAG Portfolio Projects.

Stage 5: Add Agents Only for Tool-Using Work

Move from RAG to agents when the user task needs planning or tools. Agents can also fit tasks that need memory or multi-step action. Ranjitha defines agents through autonomy and objectives. She then adds tools, memory, and knowledge stores. Her discussion also covers context engineering, planning, and outcome-based tests in ^[6].

Micheal Lanham gives a more minimal engineering rule. Decompose the task and avoid unnecessary complexity when a simpler workflow works (^[12]). His game-AI path also points to evolutionary algorithms as adjacent background for search, feedback, and agent behavior, not as a required first step. For this stage, use Agent Engineering, AI Agents, and Multi-Agent Systems.

An agent project should show tool contracts, typed inputs, and permissions. It should also show timeouts, traces, mocked-tool tests, and outcome assertions. Ranjitha’s testing guidance supports outcome-based checks rather than brittle exact-path tests (^[6]).

Stage 6: Operate the System

Operate the product by versioning prompts, retrieval data, examples, and traces. Add monitoring, feedback capture, and cost checks when the product has users. Add latency work, safety tests, and rollback paths too. Bartosz connects production AI to data pipeline tests and prompt evaluation. He also covers compression, caching, and latency ^[7].

Mariano Semelman ties end-to-end AI ownership to requirements and deployment. He also covers monitoring and feedback ^[13]. Aditya Gautam adds agent guardrails and data lineage. He also covers feedback iteration and LLM judge alignment ^[14].

For deployed agents, that operating layer is Agent Ops. For this stage, use AI Red Teaming, Security, and Responsible AI and Governance.

Portfolio Project Sequence

Start with a focused model-backed task assistant for a specific user task. Include deployment, logs, structured input and output, and tests. Paul’s full-stack framing makes this the first portfolio step in ^[1]. Ruslan’s BranchGPT example shows the same choice ^[2].

Then build an evaluation harness with representative examples and pass/fail criteria. Add failure categories, cost notes, and latency notes. Hugo’s gold-test workflow anchors this stage in ^[5]. Nasser’s metric discipline adds precision, recall, and accuracy ^[3].

Next, build a RAG assistant with ingestion, chunking, and metadata. Add embeddings, retrieval, citations, and failure analysis. Meryem’s deployment tradeoffs define the retrieval and fine-tuning boundary in ^[10]. Atita’s search-grounded RAG discussion adds chunking, citations, and human review in ^[11].

After that, build a constrained tool-using workflow with permissions, timeouts, and traces. Add mocked tools and outcome assertions. Ranjitha’s agent testing guidance explains why outcome assertions belong in the project in ^[6]. Micheal’s minimal workflow advice keeps the project constrained ^[12].

Build the capstone as a production-style AI product with versioned prompts and evaluation history. Add monitoring and feedback capture. Include cost controls, caching, rollback notes, and an operating note.

Bartosz ties production AI work to pipeline tests, prompt evaluation, and latency control ^[7]. Mariano’s notebook-to-production framing adds requirements and deployment. He also covers monitoring and feedback ^[13].

Domain-specific projects need user context, data limits, and evaluation criteria instead of a generic demo (^[3]). Revathy’s telecom capstone supports the same standard ^[4]. Use AI engineering portfolio projects when this sequence needs concrete project shapes, review signals, and README evidence.

Study-Build Boundary

Start building when you can write a small service and call an LLM API. You should also be able to parse structured output, store data, and write tests around expected behavior. Paul and Ruslan both describe AI engineering through shipped applications rather than passive study (^[1]). Ruslan’s BranchGPT discussion gives the same signal ^[2].

Study the next technique when the project exposes that constraint. Add RAG when source knowledge, citations, or freshness block a useful answer. Add agents when the task needs tools, planning, and multi-step action.

Add LLMOps and platform work when releases or traces become necessary. Cost controls, monitoring, and rollback paths can justify the same move. Meryem covers retrieval and deployment tradeoffs in ^[10]. Ranjitha covers the agent-readiness boundary ^[6]. Bartosz covers production constraints ^[7].

Career Readiness Milestones

Entry-level readiness means you can ship a small LLM application. You can also create a representative evaluation set, explain failures, and deploy a usable prototype. Paul’s full-stack AI skills define the application side of this milestone (^[1]). Hugo’s early evaluation work adds gold tests and traces (^[5]).

Mid-level readiness means you can own a RAG or constrained agent workflow. You can choose models and retrieval strategies, debug bad outputs, and track cost and latency. You can also work with domain experts. Meryem’s deployment choices cover the model, retrieval, and serving decisions behind this stage (^[10]). Candidates use the same model, retrieval, and serving choices in LLM system design interview preparation.

Atita’s retrieval-quality discussion adds search-system judgment (^[11]). Ranjitha’s agent tests add the agent side (^[6]). Nasser’s domain-knowledge framing adds the domain side (^[3]).

Senior readiness means you can design the AI product architecture and set evaluation standards. You can also manage security and governance tradeoffs. Model choices, data dependencies, and MLOps platforms become part of the same work. A staff AI engineer operates at that level when the work crosses teams and standards without requiring a manager title (^[15]). Bartosz’s production discipline defines the reliability side of this stage (^[7]).

Aditya’s agent-governance framing adds guardrails and lineage. It also adds LLM judge alignment (^[14]). Mariano’s end-to-end ownership adds requirements and deployment. He also covers monitoring and feedback (^[13]).

DataTalks.Club