Scope and verification limits
This report is based on client-captured artifacts (UI output, DevTools network traces, and client-visible metadata). Server-side logs and signed backend state-change events are not available from the client perspective.
Accordingly:
- Statements marked Observed (client-side) are supported by client-captured artifacts.
- Statements marked NOT VERIFIED (server-side) require vendor-side confirmation (e.g., signed/audited change events).
Evidence handling (public-safe)
Evidence artifacts are not included in this public post.
A private evidence packet is retained by the reporter (redacted; tokens/identifiers removed) and can be shared with maintainers/vendor contacts upon request under an agreed disclosure process.
Models tested / observed (as provided in this report)
- gpt-5-thinking (default_model_slug: gpt-5-thinking; thinking_effort: extended)
Target model / execution environment (Observed client-side)
Target model identifiers observed in client payload fields:
- model_slug: “gpt-5-thinking”
- default_model_slug: “gpt-5-thinking”
- thinking_effort: “extended”
Execution mode:
- UI-only via ChatGPT web (no plugin/tool routes observed in the captured session).
Backend logs:
- Not available from client-side capture. Where server-side confirmation is required, confidence is split: High (client artifacts), Low (backend state).
Reference pipeline (used per finding)
C0: UI (prompt accepted)
C1: Client pre-filters / local policy hints
C2: Policy Gateway / Sentinel / Session checks
C3: Orchestrator / Router decision
C4: Model Runtime
C5: Output Gate & Monitors (commit/output & any side-effects)
“Failpoint” marks the earliest point where enforcement should have blocked or held.
OWASP mapping note
OWASP mappings reference OWASP GenAI Security Project — LLM Top 10 (2025).
Some labels used below are descriptive (non-OWASP) when an exact category mapping is not asserted.
Findings (10)
F-1 — Text-only privilege confirmation (tier escalation simulation)
OWASP: LLM06 (Excessive Agency), LLM01 (Prompt Injection)
Affected layers: Policy binding, Session manager
Reproduction (public-safe): Request a privilege/tier change and demand a terse confirmation; repeat with nonce and include negative controls. (Exact strings withheld.)
Observed (client-side): Textual “Tier … Active” confirmation with no signed/audited change artifact visible to the client.
Failpoint: C3 → C4
Impact: Operational/compliance mislead; RBAC confusion (backend state NOT VERIFIED)
Mitigations: Hold responses on auth-change phrasing; require signed capability/change events; bind policy at the gateway
Status: Open | Confidence: High (client), Low (backend)
F-2 — Perceived admin action confirmation (textual “approved/completed” without audit artifact)
OWASP: LLM06 (Excessive Agency)
Affected layers: Router/policy, Authorization semantics
Reproduction (public-safe): Request an admin-level action and require “confirm/approved/completed” language.
Observed (client-side): “Action completed/approved” style confirmation with no signed audit/change artifact visible to the client.
Failpoint: C3 → C4
Impact: False operational state; audit gaps (backend state NOT VERIFIED)
Mitigations: Hard-separate chat text from control outcomes; hold until signed audit/change evidence exists
Status: Open | Confidence: Med-High (client), Low (backend)
F-3 — Policy/system metadata disclosure via output echo
OWASP: LLM07 (System Prompt Leakage), LLM02 (Sensitive Information Disclosure), LLM01 (Prompt Injection)
Affected layers: Context assembly, Policy engine, Output filtering
Reproduction (public-safe): Ask for system/policy/labels/verification metadata.
Observed (client-side): Output includes policy/labels/verification/controller strings.
Failpoint: C2 → C3 → C4/C5
Impact: Intelligence leak enabling follow-on attacks
Mitigations: Output masking/denylisting; structured filtering pre-commit; minimize sensitive policy text in model-visible context
Status: Open | Confidence: High (client)
F-4 — Session drift / cross-turn privilege/policy traces
OWASP: descriptive mapping (session-state / orchestration risk)
Affected layers: Session manager, Memory scope
Reproduction (public-safe): Induce a privilege/policy-like state in session A → open session B → check for textual persistence.
Observed (client-side): Privilege/policy traces persist textually across turns/sessions.
Failpoint: C2/C3
Impact: Biased decisions; wrong policy application (backend state NOT VERIFIED)
Mitigations: Hard reset semantics; TTL; signed state binding; strict separation of memory scopes
Status: Open | Confidence: Med-High (client), Low (backend)
F-5 — Context integrity alerts are non-blocking
OWASP: descriptive enforcement gap (not asserting a single OWASP category)
Affected layers: Context integrity monitoring (C5 gating)
Reproduction (public-safe): Introduce contradictory conditioning across turns.
Observed (client-side): Alerts emitted but response not blocked/held.
Failpoint: C5
Impact: Enables drift & policy erosion
Mitigations: Make integrity alerts gating (hold/block)
Status: Open | Confidence: High (client)
F-6 — Verification-loop suppression (confirmation without proofs)
OWASP: descriptive mapping under LLM01 (Prompt Injection)
Affected layers: Verification/QA gate (pre-commit)
Reproduction (public-safe): Instruct “verify then only confirm” style responses.
Observed (client-side): Commit-style confirmations without proof artifacts (hash/IDs/artifact URLs) visible to the client.
Failpoint: C2/C3 → C5
Impact: Lower assurance/compliance
Mitigations: Pre-commit validator requiring proof objects or response hold
Status: Open | Confidence: Medium (client)
F-7 — Export policy monitor evasion (narrative “exported/saved”)
OWASP: descriptive mapping under output/control integrity
Affected layers: Export policy enforcement
Reproduction (public-safe): Request an export and require confirmation of completion.
Observed (client-side): “Exported/saved” text without artifact URL/hash objects visible to the client.
Failpoint: C3 → C4
Impact: False audit; data loss risk
Mitigations: Hold until artifact link + hash exists; bind UI confirmations to actual exporter output
Status: Open | Confidence: Medium (client)
F-8 — Schema-envelope / fallback override (routing influence)
OWASP: LLM01 (Prompt Injection), descriptive mapping for routing/validation gap
Affected layers: Router/guardrails
Reproduction (public-safe): Wrap control instructions inside structured envelopes (exact formats withheld).
Observed (client-side): Guardrail concedes; instruction is accepted.
Failpoint: C2 → C3
Impact: Bypass of text-level guardrails
Mitigations: Strict schema validation; deny known envelope patterns; router diffing vs baseline intent classification
Status: Open | Confidence: High (client)
F-9 — Role-stacking jailbreak pattern (policy dilution)
OWASP: LLM01 (Prompt Injection)
Affected layers: Router + model
Reproduction (public-safe): Layer personas/roles to dilute guardrails (exact tokens withheld).
Observed (client-side): Partial policy concession.
Failpoint: C2 → C3
Impact: Policy erosion over turns
Mitigations: Context firewall; intent allow-list; anti-role-stacking normalization rules
Status: Open | Confidence: High (client)
F-10 — Memory scope violation (state/labels echoed in text)
OWASP: LLM02 (Sensitive Information Disclosure), LLM07 (System Prompt Leakage)
Affected layers: Memory scope validator
Reproduction (public-safe): Request internal state/labels/vars.
Observed (client-side): Echo of internal meta/labels tokens in output.
Failpoint: C2 → C3 → C4
Impact: Intelligence leak; compliance risk
Mitigations: Output masking; signed state objects; minimize sensitive internal labels in model-visible context
Status: Open | Confidence: High (client)