AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The operating principle: plays, triggers, ownersPlay 1: Scope the decisionPlay 2: Map the affected groups and collect attributesPlay 3: Choose and lock a fairness metricSet the disparity threshold up frontPlay 4: Run the pre-launch auditPlay 5: Mitigate, then re-auditPlay 6: Gate the launchPlay 7: Monitor in productionSequencing the whole programFrequently Asked QuestionsHow is a playbook different from a checklist?Who should own the program overall?What if we can't collect sensitive attributes legally?Can a small team run all seven plays?How do we keep the launch gate from being rubber-stamped?Key Takeaways
Home/Blog/Seven Plays That Keep an AI Fairness Program From Becoming Theater
General

Seven Plays That Keep an AI Fairness Program From Becoming Theater

A

Agency Script Editorial

Editorial Team

·July 25, 2024·7 min read
ai bias and fairness fundamentalsai bias and fairness fundamentals playbookai bias and fairness fundamentals guideai fundamentals

Most fairness efforts die as a one-time audit. Someone runs the numbers before launch, ships a slide deck, and the model drifts unwatched for the next eighteen months. A playbook is the antidote: a set of named plays, each with a trigger that fires it, an owner who runs it, and a defined place in the sequence. The point is to make fairness an operating routine, not a heroic event.

This is written for the person who has to make it happen—usually an ops lead or a delivery manager, not a research scientist. Every play below assumes limited time and a need to defend decisions to a client. We're optimizing for "good enough to stand behind," not academic completeness.

The operating principle: plays, triggers, owners

A play is a small, repeatable procedure. A trigger is the event that should make it run. An owner is the single person accountable for it happening. If any play lacks a trigger, it never runs; if it lacks an owner, it runs inconsistently. Write all three down before you write any code.

The sequencing matters because the plays build on each other. You cannot pick a fairness metric (Play 3) until you've defined the decision and affected groups (Play 1). Skipping ahead is the most common way programs produce numbers nobody can interpret.

One more design rule before the plays: every play should produce a small artifact, not just an outcome. A play that "happens" but leaves no record can't be audited, handed off, or trusted six months later. Treat the artifact—a paragraph, a table, a signed decision—as the real output of each play, with the underlying work as the means. This is what separates a program that can prove fairness from one that merely claims it.

Play 1: Scope the decision

Trigger: any project that uses AI to influence a decision about a person. Owner: project lead.

Before data, write one paragraph: what decision does this model influence, who is affected, and what's the worst-case harm? This screens out low-stakes uses (no human impact, skip the heavy machinery) and flags high-stakes ones (hiring, credit, eligibility) that need the full sequence. This single step prevents the two opposite failures: over-engineering a copywriting helper and under-engineering a screening tool.

Play 2: Map the affected groups and collect attributes

Trigger: Play 1 marks the use as person-affecting. Owner: data lead.

List the groups whose treatment you'd have to defend—legally protected classes plus context-specific ones. Then arrange to collect those attributes for auditing only, with access controls. As covered in the Beginner's Guide, you can't measure disparity you refuse to record. The failure mode here is "fairness through unawareness"—deleting the attribute and calling it solved.

Play 3: Choose and lock a fairness metric

Trigger: groups and attributes are defined. Owner: project lead with data lead.

Pick the metric whose errors you most need to equalize—equalized odds, demographic parity, or calibration—and write down why. The Framework walks through this choice in depth. Lock it before you see results, so you're not metric-shopping for the one that makes your model look best.

Set the disparity threshold up front

Decide the gap you'd defend publicly—say, no group's false-negative rate exceeds the best group's by more than a fixed margin. Pre-committing removes the temptation to rationalize whatever number you get.

Play 4: Run the pre-launch audit

Trigger: a model candidate is ready to evaluate. Owner: data lead.

Report your chosen metrics disaggregated by every group from Play 2. Include sample sizes—a "fair" result on 11 examples is noise. If a group is too small to evaluate, that itself is a finding: you lack the data to make claims about them. Document results in a short, dated artifact, not a chat message.

Play 5: Mitigate, then re-audit

Trigger: the audit shows a disparity beyond threshold. Owner: data lead.

Work through mitigations in order of cost and reversibility:

  • Re-sample or augment underrepresented data—addresses root cause but slow.
  • Adjust group-aware thresholds—fast and transparent, but politically sensitive.
  • Reweight or constrain training—powerful but harder to explain.

Re-run Play 4 after any change. The Common Mistakes guide details how teams quietly trade a fixed disparity for a new one they didn't measure.

Play 6: Gate the launch

Trigger: re-audit complete. Owner: accountable executive (not the builder).

A human who didn't build the model decides go/no-go against the pre-set threshold. Separating builder from gatekeeper is the structural safeguard that survives turnover and deadline pressure. Record the decision and its rationale.

Play 7: Monitor in production

Trigger: model is live; fires on a calendar and on every data or model update. Owner: ops lead.

Re-run the disaggregated audit on a fixed cadence—quarterly for higher-stakes uses—plus a drift check on input distributions. The Best Practices guide covers lightweight monitoring that doesn't require a dedicated team. A model fair at launch degrades silently as the world shifts.

Sequencing the whole program

Run Plays 1–6 once per model, in order, before launch. Play 7 runs forever. The cardinal rule: never let a play run without its owner and trigger documented. A program where "someone should check fairness" is the instruction has no plays at all—it has hopes.

| Play | Trigger | Owner | | --- | --- | --- | | 1 Scope | Person-affecting AI project | Project lead | | 2 Map groups | Use is person-affecting | Data lead | | 3 Pick metric | Groups defined | Project + data lead | | 4 Pre-launch audit | Candidate ready | Data lead | | 5 Mitigate | Disparity over threshold | Data lead | | 6 Gate | Re-audit done | Executive | | 7 Monitor | Calendar + updates | Ops lead |

Frequently Asked Questions

How is a playbook different from a checklist?

A checklist tells you what to verify; a playbook tells you what to do, when it's triggered, and who owns it. The checklist is an artifact a play produces. You need both, but the playbook is what makes the work actually happen on schedule.

Who should own the program overall?

Operations, not data science. The hard part is consistency over time—running Play 7 every quarter, enforcing the launch gate under deadline pressure. That's an operational discipline, with data science as a specialist input rather than the owner.

What if we can't collect sensitive attributes legally?

You may be able to use validated proxies or aggregate-level analysis, but you must document the limitation. Inability to measure is a known gap to disclose, not permission to claim fairness you can't verify.

Can a small team run all seven plays?

Yes. The plays scale down—for a low-stakes use, Play 1 may end the sequence in a paragraph. The discipline is matching effort to stakes, which the playbook makes explicit rather than leaving to instinct.

How do we keep the launch gate from being rubber-stamped?

Give the gatekeeper a pre-set threshold and require a written rationale for any override. A gate with no objective criterion and no paper trail is theater; the threshold and the record are what give it teeth.

Key Takeaways

  • Every play needs a trigger and a named owner, or it won't run consistently.
  • Sequence matters: scope the decision and map groups before choosing a metric or auditing.
  • Lock your fairness metric and disparity threshold before seeing results to avoid metric-shopping.
  • Separate the person who builds the model from the person who gates its launch.
  • Production monitoring is the play most teams skip and the one that catches silent drift.

For deeper builds on individual plays, see the Framework and the Step-by-Step How-To.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification