RM-002 · Intermediate · 4–6 months

AI Engineer

Build production LLM systems: prompting, RAG, evals, agents, fine-tuning, and the ops layer underneath.

An AI engineer is the bridge between research-grade models and shipping product. You don't need to train models from scratch — you need to compose them, evaluate them, and run them reliably in production.

0/28 topics · 0% complete

S01

Foundations

Just enough ML and transformer intuition to reason about how LLMs actually behave.

01
ML Basics
core
Training vs. inference, loss, gradient descent. Skip the calculus; keep the intuition.
02
Transformer Intuition
core
Attention, embeddings, autoregressive generation. Read the paper once, then move on.
- externalAttention Is All You Need↗
03
Tokenization & Context
core
Tokens, BPE, context windows, why your prompt got truncated.
04
Sampling Parameters
core
Temperature, top-p, top-k, repetition penalty — what they actually do.

S02

Prompt Engineering

The patterns that consistently lift quality without finetuning anything.

01
Prompt Patterns
core
Zero-shot, few-shot, role prompting, chain-of-thought, and when each helps.
- externalAnthropic Prompt Engineering↗
02
Structured Output
core
JSON modes, response schemas, tool-call coercion — making LLMs return clean data.
03
Prompt Caching
recommended
Cache long system prompts and shared context. Pay once, reuse for hours.
- externalPrompt Caching Docs↗
04
Context Engineering
core
Treat the context window as a UX surface. Order matters; recency matters; relevance matters.

S03

Retrieval-Augmented Generation

Give the model access to your data without retraining.

01
Embeddings
core
Vector representations of text. Pick a model, normalize, store.
02
Vector Databases
core
pgvector, Pinecone, Chroma, Weaviate. Pick by ops profile, not benchmarks.
03
Chunking Strategies
core
Fixed, semantic, hierarchical. Bad chunking ruins good retrieval.
04
Hybrid Search & Reranking
recommended
BM25 + dense retrieval + a cross-encoder reranker beats any single method.
05
RAG Evaluation
core
Faithfulness, answer relevance, context precision — measure before you tune.

S04

Agents & Tool Use

Let the model take actions in the real world — and survive when it goes off-script.

01
Tool Use (Function Calling)
core
Define tools with JSON schemas; let the model pick. The foundation of every agent.
- externalAnthropic Tool Use↗
02
Agent Loops
core
ReAct, plan-then-execute, self-critique. Most production agents are simpler than you think.
03
Multi-Agent Systems
recommended
Planner/worker, supervisor/swarm. When more agents help vs. when they just add cost.
- lessonRoster Pattern
04
MCP (Model Context Protocol)
recommended
The emerging standard for letting agents talk to your tools.
- lessonMCP Servers

S05

Evaluations

If you can't measure it, you can't ship it. Evals are the unsexy moat.

01
Eval Design
core
Start with 20 hand-graded examples. Scale from there. Don't skip this step.
02
LLM-as-Judge
recommended
Use a model to grade outputs. Calibrate against a human-labeled subset.
03
Regression Evals in CI
core
Run your eval set on every prompt change. Block merges on quality drops.

S06

Fine-tuning

When prompting plateaus. Usually you can avoid it.

01
When to Fine-tune
core
Almost never first. Exhaust prompting + RAG + tool design before reaching for FT.
02
LoRA & Adapter Methods
recommended
Cheap, fast, reversible. The default if you actually need to fine-tune.
03
Dataset Curation
recommended
1000 great examples beat 100k mediocre ones. Spend the time on data.

S07

Production

Latency, cost, observability, safety — what makes a demo a product.

01
Inference & Streaming
core
TTFT vs. total latency, streaming tokens, batching, queueing.
02
Cost Controls
core
Model tiering, caching, prompt compression, fallbacks. Costs compound fast.
03
Observability
core
Trace every call. Log inputs, outputs, latencies, token counts, costs.
04
Safety & Guardrails
recommended
Input validation, output filters, jailbreak resistance. Plan for adversarial users.
05
A/B Testing Prompts
optional
Roll prompt changes like code changes — feature-flagged, measured, reversible.

AI Engineer

ML Basics

Transformer Intuition

Tokenization & Context

Sampling Parameters

Prompt Patterns

Structured Output

Prompt Caching

Context Engineering

Embeddings

Vector Databases

Chunking Strategies

Hybrid Search & Reranking

RAG Evaluation

Tool Use (Function Calling)

Agent Loops

Multi-Agent Systems

MCP (Model Context Protocol)

Eval Design

LLM-as-Judge

Regression Evals in CI

When to Fine-tune

LoRA & Adapter Methods

Dataset Curation

Inference & Streaming

Cost Controls

Observability

Safety & Guardrails

A/B Testing Prompts