Skip to content

Untitled

Knowledge library for AI engineering practice. Complements the governance spine (AI-Governance, AI-security, AI-assurance) with practitioner profiles, tool landscape analysis, agent memory research, and cross-cutting synthesis.

Looking for a specific document? NAVIGATION.md is the complete file-level index — every document in one place, plus a question-driven finder and reading paths by goal. For an interactive version with live search and filters, open navigation.html in a browser.


Sections

DirectoryWhat it contains
foundations/5 researcher profiles — theory, interpretability, benchmarks, quantization
frameworks/5 researcher profiles — LangChain, LlamaIndex, fastai, HuggingFace, Instructor
applied/4 researcher profiles — AI systems design, composable tooling, agentic research, community
synthesis/Cross-cutting meta-analysis of all 14 profiles
evals/Evaluation as a discipline — agentic benchmarks, LLM-as-judge reliability, eval design methodology
safety-evals/Frontier safety frameworks, dangerous-capability evals, and the elicitation/validity problem for assurance
reliability/Reliability as a measurable discipline — calibration & uncertainty, hallucination & factuality, robustness
interpretability/Interpretability as assurance evidence — methods & maturity, faithfulness & the verification gap, explainability vs. regulation
multimodal/The capability surface beyond text — multimodal perception, computer-use/GUI agents, embodied action
capability-trajectory/Where capability is going — scaling laws, the emergence debate, and capability forecasting for assurance
coding-agents/Coding-agent landscape, harness design, and productivity evidence
models/Model landscape — post-training and reasoning, open-weight ecosystem, inference economics
agent-protocols/Agent interoperability standards — MCP, A2A, ACP, AGENTS.md, payment protocols
memory-agents/Deep-dive on agent memory architectures, MCP hosting, evaluation frameworks
tools/Options analyses for 18 Responsible AI tool categories (security tooling surveys live in ../AI-security/)
claude-ecosystem/Review of 23 Claude/Claude Code ecosystem repositories
platforms/Cloud AI platform release intelligence (AWS Bedrock, GCP Vertex, Azure AI)

Practitioner Profiles

Foundations — theory, research, interpretability

ResearcherClusterKey work
Alec RadfordFoundation modelsGPT series, CLIP, Whisper
Chris OlahInterpretabilityMechanistic interpretability, circuits, feature visualization
François CholletBenchmarksARC-AGI, Keras, intelligence vs. skill
Tim DettmersEfficiencyLLM.int8(), QLoRA, democratising fine-tuning
Lilian WengTheory & taxonomyAgent taxonomy, RLHF, prompting techniques

Frameworks — framework builders and tooling

ResearcherClusterKey work
Harrison ChaseOrchestrationLangChain, LangGraph, LangSmith
Jerry LiuRetrievalLlamaIndex, RAG architecture patterns
Jeremy HowardEducationfastai, ULMFiT, nbdev
Thomas WolfDistributionHuggingFace Transformers, Hub, PEFT
Jason LiuStructured outputsInstructor, Pydantic-based LLM validation

Applied — practitioners, systems design, community

ResearcherClusterKey work
Chip HuyenSystemsAI systems design for production, evaluation
Simon WillisonPracticeComposable tooling, datasette, prompt injection security
Andrej KarpathyAgenticAutonomous research automation, LLM-as-curator
Shawn Wang (swyx)CommunityAI Engineer identity, Latent Space podcast

Cross-cutting

synthesis/ — Start here for the meta-level view. Seven documents asking what the 14 profiles collectively reveal about AI engineering as a field: the stack, the abstraction debate, the evaluation crisis, convergent principles, hard choices, and the empirical record. See synthesis/README.md for the reading guide.

evals/ — Evaluation as a discipline. The synthesis layer names evaluation the field’s central unsolved problem; this branch covers the agentic benchmark landscape and its failure modes, LLM-as-judge reliability, and eval design methodology (capability vs. propensity, statistical rigour, evals as assurance evidence).

safety-evals/ — The capability-side foundation for AI assurance. Frontier safety frameworks (RSP, Preparedness, FSF) as capability-threshold criteria, the dangerous-capability taxonomy (cyber, CBRN uplift, autonomy, persuasion, deception) and its evaluators, and the elicitation/validity problem that decides whether a safety-capability result is admissible evidence or merely testimony.

reliability/ — The quality attributes assurance attests to, treated as measurable quantities rather than adjectives: calibration and uncertainty (does the model know what it knows), hallucination and factuality (the faithfulness-vs-factuality split and how each is measured), and robustness (adversarial, distribution-shift, and prompt-sensitivity regimes). The throughline: every reliability number is conditional on a distribution and carries an expiry.

interpretability/ — Whether looking inside a model can serve as assurance evidence, treated limits-first: the method toolkit and its maturity (probing, sparse autoencoders, circuits, post-hoc attribution), the faithfulness problem and verification gap that decide whether an explanation is admissible, and the gap between a regulator-satisfying explanation and a faithful one. Distinct from the foundations/olah-research/ research profile — this is the evidentiary view.

multimodal/ — The capability surface beyond text-in/text-out: perception (vision, audio, video) and its new attack surfaces (cross-modal injection, synthetic-media provenance), computer-use/GUI agents and the sharply expanded action-risk surface, and embodied action. The throughline: each step beyond text widens the input and action surface, and a safety result proven on text does not transfer to pixels, screens, or actuators.

capability-trajectory/ — Where the frontier is going, not just where it is: the scaling laws that make loss predictable, the emergence debate over what stays unpredictable (mirage or real), and capability forecasting. The throughline: loss is predictable, specific capability is not — so forecasts are decision-support with wide error bars, good enough to pre-position controls before a threshold is crossed but never to certify it won’t be. This is the time dimension behind every other branch’s staleness requirement.

coding-agents/ — The AI-assisted SDLC: cross-vendor coding-agent landscape organised by interaction model, agent harness design (the loop, tools, context, permissions), and the empirical record on productivity impact — including the negative results.

models/ — The layer below the harness: how post-training and the reasoning turn shape deployed behaviour, the open-weight ecosystem and its licensing spectrum, and inference economics (why unit prices fall while agent bills grow).

agent-protocols/ — The standards layer forming around agents: MCP in working detail, A2A, ACP, AGENTS.md, llms.txt, and the emerging payment protocols — who governs each, adoption status, and the security properties that follow.

memory-agents/ — Deep-dive on how agents remember across sessions: landscape across enterprise platforms and open-source libraries, deployment patterns, evaluation frameworks, MCP server hosting, and the MemPalace architecture. For security controls over those agents see ../AI-security/agentic/.


Disambiguation

If you want…Go to
Practitioner research profilesresearch/ (here)
AI governance control templatesAI-Governance/
International AI standards (ISO, NIST, EU AI Act)AI-Governance/standards/
Threat frameworks (MITRE ATLAS, NCSC/CISA)AI-security/
Agentic security libraries and controlsAI-security/agentic/
Assurance and audit standardsAI-assurance/
AI governance tooling ideas and PRDsai-governance-tooling/