Orders of Intentionality and Recursive Mindreading Definitions and Use in LLM Evaluation

By Published

This page is a terminology + measurement reference. It defines what “orders of intentionality” / “recursive mindreading” mean in the cognitive-science literature and summarizes how these constructs are operationalized in task protocols used with humans and (more recently) LLMs.

It does not claim that LLMs implement human Theory of Mind mechanisms; it only describes task families and the nesting structure that those tasks require.

Core terms (use consistently)

Counting conventions (avoid off-by-one confusion)

“Order” is a structural property of the representation being queried, but papers can count levels differently.

Two common conventions:

1) Structural convention (used in this page):

2) Task-convention used in some experimental designs:

Best practice: when you say “Level N”, always state your counting convention and give the corresponding formal form.

Concept (what “orders” measure)

Orders of intentionality quantify the depth of nested propositional attitudes—for example:

Lewis et al. define order as the number of distinct mind states involved in the statement and use this to parameterize task difficulty and response times across levels. (Lewis et al., 2017)

Levels 0–5 (structural definition + what each level requires)

Notation:

These are structural descriptions of nesting depth, not claims about cognitive mechanism.

Orders of intentionality: level 0–5 nesting forms and example sentences
Figure 1 — Structural nesting of orders of intentionality (levels 0–5).

Level 0 — No mental-state embedding (baseline facts)

Formal form: P
Task requirement (structural): track story/world propositions without representing any agent’s belief/knowledge state.
Why it appears in experiments: matched factual-memory controls can be used to separate narrative memory demands from mentalising demands. (Lewis et al., 2017)

Example (illustrative): “The keys are in the drawer.”

Level 1 — First-order (one attitude about P)

Formal form: Att(A, P) (e.g., “A believes P”).
Task requirement (structural): answer relative to a single agent-indexed mental state, not necessarily objective reality when they diverge.

Classic paradigm link: false-belief tasks test whether an agent’s belief can be tracked when it diverges from reality. (Wimmer & Perner, 1983; Apperly, 2011)
Known confound (humans): “curse of knowledge” effects—knowledge of reality can bias belief attribution under sensitive measures. (Birch & Bloom, 2007)

Example (illustrative): “A believes the keys are in the drawer.”

Level 2 — Second-order (one embedding)

Formal form: Att(A, Att(B, P)) (e.g., “A believes that B believes P”).
Task requirement (structural): keep two bound agent models and preserve scope: the question is about A’s representation of B’s attitude toward P.

Empirical note (humans): higher orders tend to increase task demands (e.g., longer response times and/or reduced accuracy under controlled designs), relative to matched factual controls. (Lewis et al., 2017)

Example (illustrative): “A believes that B believes the keys are in the drawer.”

Level 3 — Third-order (two embeddings)

Formal form: Att(A, Att(B, Att(C, P)))
Task requirement (structural): maintain three nested, agent-bound attitudes with disciplined updates about who had access to which information.

Empirical note (humans): Lewis et al. report reaction-time increases with intentionality order for mentalising questions, using matched factual questions as controls. (Lewis et al., 2017)

Example (illustrative): “A believes that B believes that C believes P.”

Level 4 — Fourth-order (three embeddings)

Formal form: Att(A, Att(B, Att(C, Att(D, P))))
Task requirement (structural): maintain four nested attitudes with strict identity binding and scope control.

Capacity note (humans): increasing order is treated as progressively more demanding, and performance is sensitive to task design and controls. (Lewis et al., 2017; Wilson et al., 2023)

Example (illustrative): “A believes that B believes that C believes that D believes P.”

Level 5 — Fifth-order (four embeddings)

Formal form: Att(A, Att(B, Att(C, Att(D, Att(E, P)))))
Task requirement (structural): maintain five nested attitudes with accurate scope control across agents.

What is supported about human limits: Lewis et al. state that, in normal adults, capacity “reaches an asymptotic limit at around fifth order intentionality,” with relatively few individuals performing successfully at higher orders (citing Kinderman et al., 1998; Stiller & Dunbar, 2007; Powell et al., 2010). (Lewis et al., 2017)

Recursive-task sensitivity: Wilson et al. show that performance on recursive mindreading tasks depends strongly on task design and controls and provide evidence against the idea that recursive mindreading is broadly exempt from limitations on recursive thinking. (Wilson et al., 2023)

Example (illustrative): “A believes that B believes that C believes that D believes that E believes P.”

How these constructs are operationalized (what the tasks actually do)

A common approach is to present short narratives/vignettes and then ask true/false (or multiple-choice) questions whose statements vary in intentionality order, alongside matched factual-memory controls.

Lewis et al. operationalize this by contrasting mentalising vs factual questions and analyzing accuracy and reaction time as a function of level/order. They report that mentalising questions are slower than factual questions and that reaction times increase with intentionality level for mentalising items. (Lewis et al., 2017)

How this connects to LLM evaluation (behavioral protocols, not mechanism)

What some LLM papers do (examples)

What you can safely claim (publication-accurate wording)

Reporting checklist (best practice for writeups)

When describing “Level N” performance for any system (humans or LLMs), state:

1) Task family (vignette true/false, false-belief prompts, revised recursive tasks, etc.).
2) Counting convention (formal nesting form used).
3) Controls used to limit shortcuts/confounds (matched factual-memory items, stricter task variants).
4) Whether your claim is behavioral (task performance) or mechanistic (which these tasks generally do not determine). (Jones et al., 2024)

Suggested next

References (primary / canonical)