Untitled

Knowledge library for AI engineering practice. Complements the governance spine (AI-Governance, AI-security, AI-assurance) with practitioner profiles, tool landscape analysis, agent memory research, and cross-cutting synthesis.

Looking for a specific document? NAVIGATION.md is the complete file-level index — every document in one place, plus a question-driven finder and reading paths by goal. For an interactive version with live search and filters, open navigation.html in a browser.

Sections

Directory	What it contains
`foundations/`	5 researcher profiles — theory, interpretability, benchmarks, quantization
`frameworks/`	5 researcher profiles — LangChain, LlamaIndex, fastai, HuggingFace, Instructor
`applied/`	4 researcher profiles — AI systems design, composable tooling, agentic research, community
`synthesis/`	Cross-cutting meta-analysis of all 14 profiles
`evals/`	Evaluation as a discipline — agentic benchmarks, LLM-as-judge reliability, eval design methodology
`safety-evals/`	Frontier safety frameworks, dangerous-capability evals, and the elicitation/validity problem for assurance
`reliability/`	Reliability as a measurable discipline — calibration & uncertainty, hallucination & factuality, robustness
`interpretability/`	Interpretability as assurance evidence — methods & maturity, faithfulness & the verification gap, explainability vs. regulation
`multimodal/`	The capability surface beyond text — multimodal perception, computer-use/GUI agents, embodied action
`capability-trajectory/`	Where capability is going — scaling laws, the emergence debate, and capability forecasting for assurance
`coding-agents/`	Coding-agent landscape, harness design, and productivity evidence
`models/`	Model landscape — post-training and reasoning, open-weight ecosystem, inference economics
`agent-protocols/`	Agent interoperability standards — MCP, A2A, ACP, AGENTS.md, payment protocols
`memory-agents/`	Deep-dive on agent memory architectures, MCP hosting, evaluation frameworks
`tools/`	Options analyses for 18 Responsible AI tool categories (security tooling surveys live in `../AI-security/`)
`claude-ecosystem/`	Review of 23 Claude/Claude Code ecosystem repositories
`platforms/`	Cloud AI platform release intelligence (AWS Bedrock, GCP Vertex, Azure AI)

Practitioner Profiles

Foundations — theory, research, interpretability

Researcher	Cluster	Key work
Alec Radford	Foundation models	GPT series, CLIP, Whisper
Chris Olah	Interpretability	Mechanistic interpretability, circuits, feature visualization
François Chollet	Benchmarks	ARC-AGI, Keras, intelligence vs. skill
Tim Dettmers	Efficiency	LLM.int8(), QLoRA, democratising fine-tuning
Lilian Weng	Theory & taxonomy	Agent taxonomy, RLHF, prompting techniques

Frameworks — framework builders and tooling

Researcher	Cluster	Key work
Harrison Chase	Orchestration	LangChain, LangGraph, LangSmith
Jerry Liu	Retrieval	LlamaIndex, RAG architecture patterns
Jeremy Howard	Education	fastai, ULMFiT, nbdev
Thomas Wolf	Distribution	HuggingFace Transformers, Hub, PEFT
Jason Liu	Structured outputs	Instructor, Pydantic-based LLM validation

Applied — practitioners, systems design, community

Researcher	Cluster	Key work
Chip Huyen	Systems	AI systems design for production, evaluation
Simon Willison	Practice	Composable tooling, datasette, prompt injection security
Andrej Karpathy	Agentic	Autonomous research automation, LLM-as-curator
Shawn Wang (swyx)	Community	AI Engineer identity, Latent Space podcast

Cross-cutting

synthesis/ — Start here for the meta-level view. Seven documents asking what the 14 profiles collectively reveal about AI engineering as a field: the stack, the abstraction debate, the evaluation crisis, convergent principles, hard choices, and the empirical record. See synthesis/README.md for the reading guide.

evals/ — Evaluation as a discipline. The synthesis layer names evaluation the field’s central unsolved problem; this branch covers the agentic benchmark landscape and its failure modes, LLM-as-judge reliability, and eval design methodology (capability vs. propensity, statistical rigour, evals as assurance evidence).

safety-evals/ — The capability-side foundation for AI assurance. Frontier safety frameworks (RSP, Preparedness, FSF) as capability-threshold criteria, the dangerous-capability taxonomy (cyber, CBRN uplift, autonomy, persuasion, deception) and its evaluators, and the elicitation/validity problem that decides whether a safety-capability result is admissible evidence or merely testimony.

reliability/ — The quality attributes assurance attests to, treated as measurable quantities rather than adjectives: calibration and uncertainty (does the model know what it knows), hallucination and factuality (the faithfulness-vs-factuality split and how each is measured), and robustness (adversarial, distribution-shift, and prompt-sensitivity regimes). The throughline: every reliability number is conditional on a distribution and carries an expiry.

interpretability/ — Whether looking inside a model can serve as assurance evidence, treated limits-first: the method toolkit and its maturity (probing, sparse autoencoders, circuits, post-hoc attribution), the faithfulness problem and verification gap that decide whether an explanation is admissible, and the gap between a regulator-satisfying explanation and a faithful one. Distinct from the foundations/olah-research/ research profile — this is the evidentiary view.

multimodal/ — The capability surface beyond text-in/text-out: perception (vision, audio, video) and its new attack surfaces (cross-modal injection, synthetic-media provenance), computer-use/GUI agents and the sharply expanded action-risk surface, and embodied action. The throughline: each step beyond text widens the input and action surface, and a safety result proven on text does not transfer to pixels, screens, or actuators.

capability-trajectory/ — Where the frontier is going, not just where it is: the scaling laws that make loss predictable, the emergence debate over what stays unpredictable (mirage or real), and capability forecasting. The throughline: loss is predictable, specific capability is not — so forecasts are decision-support with wide error bars, good enough to pre-position controls before a threshold is crossed but never to certify it won’t be. This is the time dimension behind every other branch’s staleness requirement.

coding-agents/ — The AI-assisted SDLC: cross-vendor coding-agent landscape organised by interaction model, agent harness design (the loop, tools, context, permissions), and the empirical record on productivity impact — including the negative results.

models/ — The layer below the harness: how post-training and the reasoning turn shape deployed behaviour, the open-weight ecosystem and its licensing spectrum, and inference economics (why unit prices fall while agent bills grow).

agent-protocols/ — The standards layer forming around agents: MCP in working detail, A2A, ACP, AGENTS.md, llms.txt, and the emerging payment protocols — who governs each, adoption status, and the security properties that follow.

memory-agents/ — Deep-dive on how agents remember across sessions: landscape across enterprise platforms and open-source libraries, deployment patterns, evaluation frameworks, MCP server hosting, and the MemPalace architecture. For security controls over those agents see ../AI-security/agentic/.

Disambiguation

If you want…	Go to
Practitioner research profiles	`research/` (here)
AI governance control templates	`AI-Governance/`
International AI standards (ISO, NIST, EU AI Act)	`AI-Governance/standards/`
Threat frameworks (MITRE ATLAS, NCSC/CISA)	`AI-security/`
Agentic security libraries and controls	`AI-security/agentic/`
Assurance and audit standards	`AI-assurance/`
AI governance tooling ideas and PRDs	`ai-governance-tooling/`