RM-002 · Intermediate · 4–6 months

AI Engineer

Build production LLM systems: prompting, RAG, evals, agents, fine-tuning, and the ops layer underneath.

An AI engineer is the bridge between research-grade models and shipping product. You don't need to train models from scratch — you need to compose them, evaluate them, and run them reliably in production.

0/28 topics · 0% complete
S01

Foundations

Just enough ML and transformer intuition to reason about how LLMs actually behave.

  1. 01

    ML Basics

    core

    Training vs. inference, loss, gradient descent. Skip the calculus; keep the intuition.

  2. 02

    Transformer Intuition

    core

    Attention, embeddings, autoregressive generation. Read the paper once, then move on.

  3. 03

    Tokenization & Context

    core

    Tokens, BPE, context windows, why your prompt got truncated.

  4. 04

    Sampling Parameters

    core

    Temperature, top-p, top-k, repetition penalty — what they actually do.

S02

Prompt Engineering

The patterns that consistently lift quality without finetuning anything.

  1. 01

    Prompt Patterns

    core

    Zero-shot, few-shot, role prompting, chain-of-thought, and when each helps.

  2. 02

    Structured Output

    core

    JSON modes, response schemas, tool-call coercion — making LLMs return clean data.

  3. 03

    Prompt Caching

    recommended

    Cache long system prompts and shared context. Pay once, reuse for hours.

  4. 04

    Context Engineering

    core

    Treat the context window as a UX surface. Order matters; recency matters; relevance matters.

S03

Retrieval-Augmented Generation

Give the model access to your data without retraining.

  1. 01

    Embeddings

    core

    Vector representations of text. Pick a model, normalize, store.

  2. 02

    Vector Databases

    core

    pgvector, Pinecone, Chroma, Weaviate. Pick by ops profile, not benchmarks.

  3. 03

    Chunking Strategies

    core

    Fixed, semantic, hierarchical. Bad chunking ruins good retrieval.

  4. 04

    Hybrid Search & Reranking

    recommended

    BM25 + dense retrieval + a cross-encoder reranker beats any single method.

  5. 05

    RAG Evaluation

    core

    Faithfulness, answer relevance, context precision — measure before you tune.

S04

Agents & Tool Use

Let the model take actions in the real world — and survive when it goes off-script.

  1. 01

    Tool Use (Function Calling)

    core

    Define tools with JSON schemas; let the model pick. The foundation of every agent.

  2. 02

    Agent Loops

    core

    ReAct, plan-then-execute, self-critique. Most production agents are simpler than you think.

  3. 03

    Multi-Agent Systems

    recommended

    Planner/worker, supervisor/swarm. When more agents help vs. when they just add cost.

  4. 04

    MCP (Model Context Protocol)

    recommended

    The emerging standard for letting agents talk to your tools.

S05

Evaluations

If you can't measure it, you can't ship it. Evals are the unsexy moat.

  1. 01

    Eval Design

    core

    Start with 20 hand-graded examples. Scale from there. Don't skip this step.

  2. 02

    LLM-as-Judge

    recommended

    Use a model to grade outputs. Calibrate against a human-labeled subset.

  3. 03

    Regression Evals in CI

    core

    Run your eval set on every prompt change. Block merges on quality drops.

S06

Fine-tuning

When prompting plateaus. Usually you can avoid it.

  1. 01

    When to Fine-tune

    core

    Almost never first. Exhaust prompting + RAG + tool design before reaching for FT.

  2. 02

    LoRA & Adapter Methods

    recommended

    Cheap, fast, reversible. The default if you actually need to fine-tune.

  3. 03

    Dataset Curation

    recommended

    1000 great examples beat 100k mediocre ones. Spend the time on data.

S07

Production

Latency, cost, observability, safety — what makes a demo a product.

  1. 01

    Inference & Streaming

    core

    TTFT vs. total latency, streaming tokens, batching, queueing.

  2. 02

    Cost Controls

    core

    Model tiering, caching, prompt compression, fallbacks. Costs compound fast.

  3. 03

    Observability

    core

    Trace every call. Log inputs, outputs, latencies, token counts, costs.

  4. 04

    Safety & Guardrails

    recommended

    Input validation, output filters, jailbreak resistance. Plan for adversarial users.

  5. 05

    A/B Testing Prompts

    optional

    Roll prompt changes like code changes — feature-flagged, measured, reversible.