Wiki

LLMs

Large language models with links to retrieval, agents, evaluation, production, and security pages.

Related Wiki Pages

AI AI Engineering AI Tooling LLM Production Patterns Retrieval-Augmented Generation LLM Evaluation Workflows Long-Context LLM Evaluation Agent Engineering Generative AI Multimodal LLMs NLP

Large language models are machine learning models trained to process and generate language. Teams use them for text generation, summarization, translation, and information extraction. They also show up in retrieval-backed question answering, agents, and developer tools.

An LLM is rarely a finished product on its own. The model sits inside a larger system with prompts, retrieval, data pipelines, and evaluation ^[1]. Neighboring pages cover serving, release, monitoring, and guardrails. Start with LLM Production Patterns, LLMOps, and LLM Evaluation Workflows.

That places LLMs near AI and NLP. It also links them to generative AI, agent engineering, and LLM production patterns.

Capabilities and Prompting

An LLM is a general language model that teams can prompt for many language tasks, then adapt with context, examples, and retrieval. Teams further tune it with fine-tuning or tools when prompting isn’t enough. Build a Large Language Model (From Scratch) grounds that definition by walking through a transformer-based model from the ground up.

Generative and non-generative language models are distinct. Modern LLMs use transformers because they handle unstructured text at scale ^[2].

The traditional NLP pipeline labels data, designs the task, tests behavior, and deploys the system. GPT-3-style prompting contrasts with that pipeline: a model can produce useful behavior from a prompt instead of a task-specific training pipeline ^[3]. The GPT-3 book by Sandra Kublik and Shubham Saboo collects the early practitioner stories behind that prompt-driven shift.

Everyday uses include summaries, translation, and CSV workflows. Prompting practice adds role prompts, structured output, and timestamps ^[1].

Model Boundaries

LLMs are useful, but different failure modes call for drawing different boundaries first.

For deployment, open-source and API models differ in control, privacy, and fine-tuning. Model-drift risk appears when an API provider changes behavior behind the scenes ^[2]. That makes LLM adoption an AI infrastructure and production decision. Making models smaller and faster for deployment is the practice of Model Optimization. The serving and reliability patterns live in LLM Production Patterns.

For NLP team design, GPT-3 has limits around cost and control plus bias and privacy risks. It’s useful for MVPs, but it doesn’t replace in-house pipelines when the team needs control ^[3].

For applied research, long-context evaluation reveals performance drops around 32k-64k context in a financial benchmark. That ties LLM quality to empirical tests rather than advertised context length ^[4].

For trust and safety, hallucinations, legal exposure, and financial incidents drive layered defenses. Non-LLM classifiers can help when a generative model is too easy to manipulate ^[5].

LLM Use Cases

Practical language work includes summaries, translation, and CSV handling. Transcript automation uses tools such as Gemini, Descript, and Loom. Developer assistants include GitHub Copilot, Cursor, and IDE agents ^[1]. Use AI tools for personal productivity for the personal workflow version of these examples. Screenshots, diagrams, audio, and video move the same product boundary toward multimodal LLMs.

LLMs also appear as product interfaces for chatbots, controlled machine translation, and moderation support. In high-risk workflows, people review the model output ^[5].

Agents are a separate use case because the model does more than answer once. Agents combine LLM autonomy with objectives and tool use. They may also use memory and knowledge stores, and the same design questions extend to Multi-Agent Systems ^[6].

Micheal Lanham connects game AI state, actions, and feedback to LLM agent workflows in Game AI to LLM Agents ^[7]. That bridge also touches evolutionary algorithms when prompts or behaviors are treated as candidates to test against feedback. The AI agents page separates agent workflow design from ordinary prompting.

RAG and Fine-Tuning

Changing model behavior and adding current knowledge are separate jobs.

Fine-tuning handles specialization, domain adaptation, and tone and format control. Retrieval handles changing knowledge: the team indexes documents and retrieves relevant passages without retraining the model for every fact update ^[2].

RAG vs Fine-Tuning uses the same split. Retrieval helps when the system needs fresh documents, citations, proprietary knowledge, or reviewable evidence. Fine-tuning helps when the model should behave differently. It can also help with a repeated output style or a repeated task.

On the implementation path, RAG with chunking and embeddings can be a quick business win. Chunk size, sliding windows, and context rot decide what the model sees ^[1].

Chunking, retrieval, and summarization for large documents are the research reason to prefer retrieval in many long-document settings, given long-context performance limits ^[4].

Quality Questions

LLM evaluation is task-specific because a model’s general benchmark score doesn’t prove its workflow.

Generator-evaluator runs provide automated quality control. Gold tests raise questions of cost, representativeness, and test-set size. Failure analysis decides whether retrieval, prompts, or data should change ^[1].

For agents, quality checks extend to custom datasets and mocked tools. They also use integration tests and regression tests. Outcome assertions fit better than exact path matching because valid agent runs may take different tool-call paths ^[6].

Long-context models need tests that match the document task. In one financial setting, applied research checks long-context behavior instead of relying on context-window size alone ^[4]. That makes long-context LLM evaluation part of the evaluation path for document-heavy systems.

Use LLM Evaluation Workflows for the evaluation workflow and Production Search Evaluation for retrieval quality when RAG depends on search.

Production Neighbors

Production LLM systems need normal software and ML operations around the model. Teams have to plan deployment, latency, and cost. They also need monitoring, rollback, and ownership. Serving and reliability concerns belong on LLM Production Patterns. Release discipline, traces, evaluation sets, and feedback loops belong on LLMOps.

Teams may prototype with hosted APIs before choosing open-source LLMs for production. That choice brings latency and cost questions. It also brings self-hosting, hardware, and provider-drift questions because hidden model changes can shift product behavior ^[2]. For the staged version of that choice, use the LLM and RAG Production Roadmap.

Context engineering is the concept bridge from this page into production. RAG brings latency, cost, and noisy inputs. Chunking, metadata, and wrappers decide what useful context reaches the model ^[6].

Operational feedback loops rest on logging, traces, and debuggable MVPs ^[1]. Those practices connect LLM work to MLOps, software engineering, and AI engineering.

Security and Trust

Security isn’t a late-stage add-on for LLM systems because the model receives instructions from users and sometimes from retrieved documents. That creates new attack paths around prompt injection, data exfiltration, hallucinated answers, and overconfident users.

A large-scale chatbot hacking exercise exposes data exfiltration through prompt overload and knowledge-base retrieval. The defenses include output validation, query analysis, layered defenses, and non-LLM classifiers where they’re harder to manipulate than generative models ^[5]. Use prompt injection and chatbot risk management for the narrower chatbot control model.

GPT-3 risks include concerns around cost, control, bias, and privacy ^[3]. Those concerns connect LLMs to AI red teaming, security, and privacy engineering for ML.

Security also affects retrieval because a RAG system may retrieve confidential or poisoned documents. The LLM can then expose or amplify them, so validation, query analysis, and layered checks need to surround generation ^[5].

LLM work connects most often to production, retrieval, and evaluation. Agent and security topics sit nearby too.

DataTalks.Club