Untitled
Knowledge library for AI engineering practice. Complements the governance spine (AI-Governance, AI-security, AI-assurance) with practitioner profiles, tool landscape analysis, agent memory research, and cross-cutting synthesis.
Looking for a specific document?
NAVIGATION.mdis the complete file-level index — every document in one place, plus a question-driven finder and reading paths by goal. For an interactive version with live search and filters, opennavigation.htmlin a browser.
Sections
| Directory | What it contains |
|---|---|
foundations/ | 5 researcher profiles — theory, interpretability, benchmarks, quantization |
frameworks/ | 5 researcher profiles — LangChain, LlamaIndex, fastai, HuggingFace, Instructor |
applied/ | 4 researcher profiles — AI systems design, composable tooling, agentic research, community |
synthesis/ | Cross-cutting meta-analysis of all 14 profiles |
evals/ | Evaluation as a discipline — agentic benchmarks, LLM-as-judge reliability, eval design methodology |
safety-evals/ | Frontier safety frameworks, dangerous-capability evals, and the elicitation/validity problem for assurance |
reliability/ | Reliability as a measurable discipline — calibration & uncertainty, hallucination & factuality, robustness |
interpretability/ | Interpretability as assurance evidence — methods & maturity, faithfulness & the verification gap, explainability vs. regulation |
multimodal/ | The capability surface beyond text — multimodal perception, computer-use/GUI agents, embodied action |
capability-trajectory/ | Where capability is going — scaling laws, the emergence debate, and capability forecasting for assurance |
coding-agents/ | Coding-agent landscape, harness design, and productivity evidence |
models/ | Model landscape — post-training and reasoning, open-weight ecosystem, inference economics |
agent-protocols/ | Agent interoperability standards — MCP, A2A, ACP, AGENTS.md, payment protocols |
memory-agents/ | Deep-dive on agent memory architectures, MCP hosting, evaluation frameworks |
tools/ | Options analyses for 18 Responsible AI tool categories (security tooling surveys live in ../AI-security/) |
claude-ecosystem/ | Review of 23 Claude/Claude Code ecosystem repositories |
platforms/ | Cloud AI platform release intelligence (AWS Bedrock, GCP Vertex, Azure AI) |
Practitioner Profiles
Foundations — theory, research, interpretability
| Researcher | Cluster | Key work |
|---|---|---|
| Alec Radford | Foundation models | GPT series, CLIP, Whisper |
| Chris Olah | Interpretability | Mechanistic interpretability, circuits, feature visualization |
| François Chollet | Benchmarks | ARC-AGI, Keras, intelligence vs. skill |
| Tim Dettmers | Efficiency | LLM.int8(), QLoRA, democratising fine-tuning |
| Lilian Weng | Theory & taxonomy | Agent taxonomy, RLHF, prompting techniques |
Frameworks — framework builders and tooling
| Researcher | Cluster | Key work |
|---|---|---|
| Harrison Chase | Orchestration | LangChain, LangGraph, LangSmith |
| Jerry Liu | Retrieval | LlamaIndex, RAG architecture patterns |
| Jeremy Howard | Education | fastai, ULMFiT, nbdev |
| Thomas Wolf | Distribution | HuggingFace Transformers, Hub, PEFT |
| Jason Liu | Structured outputs | Instructor, Pydantic-based LLM validation |
Applied — practitioners, systems design, community
| Researcher | Cluster | Key work |
|---|---|---|
| Chip Huyen | Systems | AI systems design for production, evaluation |
| Simon Willison | Practice | Composable tooling, datasette, prompt injection security |
| Andrej Karpathy | Agentic | Autonomous research automation, LLM-as-curator |
| Shawn Wang (swyx) | Community | AI Engineer identity, Latent Space podcast |
Cross-cutting
synthesis/ — Start here for the meta-level view. Seven documents asking what
the 14 profiles collectively reveal about AI engineering as a field: the stack, the abstraction
debate, the evaluation crisis, convergent principles, hard choices, and the empirical record.
See synthesis/README.md for the reading guide.
evals/ — Evaluation as a discipline. The synthesis layer names evaluation the field’s central unsolved problem; this branch covers the agentic benchmark landscape and its failure modes, LLM-as-judge reliability, and eval design methodology (capability vs. propensity, statistical rigour, evals as assurance evidence).
safety-evals/ — The capability-side foundation for AI assurance. Frontier safety frameworks (RSP, Preparedness, FSF) as capability-threshold criteria, the dangerous-capability taxonomy (cyber, CBRN uplift, autonomy, persuasion, deception) and its evaluators, and the elicitation/validity problem that decides whether a safety-capability result is admissible evidence or merely testimony.
reliability/ — The quality attributes assurance attests to, treated as measurable quantities rather than adjectives: calibration and uncertainty (does the model know what it knows), hallucination and factuality (the faithfulness-vs-factuality split and how each is measured), and robustness (adversarial, distribution-shift, and prompt-sensitivity regimes). The throughline: every reliability number is conditional on a distribution and carries an expiry.
interpretability/ — Whether looking inside a model can serve as assurance
evidence, treated limits-first: the method toolkit and its maturity (probing, sparse autoencoders,
circuits, post-hoc attribution), the faithfulness problem and verification gap that decide whether an
explanation is admissible, and the gap between a regulator-satisfying explanation and a faithful one.
Distinct from the foundations/olah-research/ research profile — this
is the evidentiary view.
multimodal/ — The capability surface beyond text-in/text-out: perception (vision, audio, video) and its new attack surfaces (cross-modal injection, synthetic-media provenance), computer-use/GUI agents and the sharply expanded action-risk surface, and embodied action. The throughline: each step beyond text widens the input and action surface, and a safety result proven on text does not transfer to pixels, screens, or actuators.
capability-trajectory/ — Where the frontier is going, not just where it is: the scaling laws that make loss predictable, the emergence debate over what stays unpredictable (mirage or real), and capability forecasting. The throughline: loss is predictable, specific capability is not — so forecasts are decision-support with wide error bars, good enough to pre-position controls before a threshold is crossed but never to certify it won’t be. This is the time dimension behind every other branch’s staleness requirement.
coding-agents/ — The AI-assisted SDLC: cross-vendor coding-agent landscape organised by interaction model, agent harness design (the loop, tools, context, permissions), and the empirical record on productivity impact — including the negative results.
models/ — The layer below the harness: how post-training and the reasoning turn shape deployed behaviour, the open-weight ecosystem and its licensing spectrum, and inference economics (why unit prices fall while agent bills grow).
agent-protocols/ — The standards layer forming around agents: MCP in working detail, A2A, ACP, AGENTS.md, llms.txt, and the emerging payment protocols — who governs each, adoption status, and the security properties that follow.
memory-agents/ — Deep-dive on how agents remember across sessions:
landscape across enterprise platforms and open-source libraries, deployment patterns, evaluation
frameworks, MCP server hosting, and the MemPalace architecture. For security controls over
those agents see ../AI-security/agentic/.
Disambiguation
| If you want… | Go to |
|---|---|
| Practitioner research profiles | research/ (here) |
| AI governance control templates | AI-Governance/ |
| International AI standards (ISO, NIST, EU AI Act) | AI-Governance/standards/ |
| Threat frameworks (MITRE ATLAS, NCSC/CISA) | AI-security/ |
| Agentic security libraries and controls | AI-security/agentic/ |
| Assurance and audit standards | AI-assurance/ |
| AI governance tooling ideas and PRDs | ai-governance-tooling/ |