Skip to the content.
AI-agents-playbook
Home
How-to
Prompts
Policies
Reference
Articles
Model training and evaluation
Landing page for notes on:
evaluation methods (offline, online, red-team)
reliability and calibration for factual outputs
multi-step agent evaluation (tool-use regressions, state carryover)
Notes
Fluency Is Not Factuality
Theory of mind in LLMs — what benchmarks test (and what they don’t)
Orders of intentionality and recursive mindreading in LLM evaluation