Seven Plays That Keep an AI Fairness Program From Becoming Theater

Most fairness efforts die as a one-time audit. Someone runs the numbers before launch, ships a slide deck, and the model drifts unwatched for the next eighteen months. A playbook is the antidote: a set of named plays, each with a trigger that fires it, an owner who runs it, and a defined place in the sequence. The point is to make fairness an operating routine, not a heroic event.

This is written for the person who has to make it happen—usually an ops lead or a delivery manager, not a research scientist. Every play below assumes limited time and a need to defend decisions to a client. We're optimizing for "good enough to stand behind," not academic completeness.

The operating principle: plays, triggers, owners

A play is a small, repeatable procedure. A trigger is the event that should make it run. An owner is the single person accountable for it happening. If any play lacks a trigger, it never runs; if it lacks an owner, it runs inconsistently. Write all three down before you write any code.

The sequencing matters because the plays build on each other. You cannot pick a fairness metric (Play 3) until you've defined the decision and affected groups (Play 1). Skipping ahead is the most common way programs produce numbers nobody can interpret.

One more design rule before the plays: every play should produce a small artifact, not just an outcome. A play that "happens" but leaves no record can't be audited, handed off, or trusted six months later. Treat the artifact—a paragraph, a table, a signed decision—as the real output of each play, with the underlying work as the means. This is what separates a program that can prove fairness from one that merely claims it.

Play 1: Scope the decision

Trigger: any project that uses AI to influence a decision about a person. Owner: project lead.

Before data, write one paragraph: what decision does this model influence, who is affected, and what's the worst-case harm? This screens out low-stakes uses (no human impact, skip the heavy machinery) and flags high-stakes ones (hiring, credit, eligibility) that need the full sequence. This single step prevents the two opposite failures: over-engineering a copywriting helper and under-engineering a screening tool.

Play 2: Map the affected groups and collect attributes

Trigger: Play 1 marks the use as person-affecting. Owner: data lead.

List the groups whose treatment you'd have to defend—legally protected classes plus context-specific ones. Then arrange to collect those attributes for auditing only, with access controls. As covered in the Beginner's Guide, you can't measure disparity you refuse to record. The failure mode here is "fairness through unawareness"—deleting the attribute and calling it solved.

Play 3: Choose and lock a fairness metric

Trigger: groups and attributes are defined. Owner: project lead with data lead.

Pick the metric whose errors you most need to equalize—equalized odds, demographic parity, or calibration—and write down why. The Framework walks through this choice in depth. Lock it before you see results, so you're not metric-shopping for the one that makes your model look best.

Set the disparity threshold up front

Decide the gap you'd defend publicly—say, no group's false-negative rate exceeds the best group's by more than a fixed margin. Pre-committing removes the temptation to rationalize whatever number you get.

Play 4: Run the pre-launch audit

Trigger: a model candidate is ready to evaluate. Owner: data lead.

Report your chosen metrics disaggregated by every group from Play 2. Include sample sizes—a "fair" result on 11 examples is noise. If a group is too small to evaluate, that itself is a finding: you lack the data to make claims about them. Document results in a short, dated artifact, not a chat message.

Play 5: Mitigate, then re-audit

Trigger: the audit shows a disparity beyond threshold. Owner: data lead.

Work through mitigations in order of cost and reversibility:

Re-sample or augment underrepresented data—addresses root cause but slow.
Adjust group-aware thresholds—fast and transparent, but politically sensitive.
Reweight or constrain training—powerful but harder to explain.

Re-run Play 4 after any change. The Common Mistakes guide details how teams quietly trade a fixed disparity for a new one they didn't measure.

Play 6: Gate the launch

Trigger: re-audit complete. Owner: accountable executive (not the builder).

A human who didn't build the model decides go/no-go against the pre-set threshold. Separating builder from gatekeeper is the structural safeguard that survives turnover and deadline pressure. Record the decision and its rationale.

Play 7: Monitor in production

Trigger: model is live; fires on a calendar and on every data or model update. Owner: ops lead.

Re-run the disaggregated audit on a fixed cadence—quarterly for higher-stakes uses—plus a drift check on input distributions. The Best Practices guide covers lightweight monitoring that doesn't require a dedicated team. A model fair at launch degrades silently as the world shifts.

Sequencing the whole program

Run Plays 1–6 once per model, in order, before launch. Play 7 runs forever. The cardinal rule: never let a play run without its owner and trigger documented. A program where "someone should check fairness" is the instruction has no plays at all—it has hopes.

| Play | Trigger | Owner | | --- | --- | --- | | 1 Scope | Person-affecting AI project | Project lead | | 2 Map groups | Use is person-affecting | Data lead | | 3 Pick metric | Groups defined | Project + data lead | | 4 Pre-launch audit | Candidate ready | Data lead | | 5 Mitigate | Disparity over threshold | Data lead | | 6 Gate | Re-audit done | Executive | | 7 Monitor | Calendar + updates | Ops lead |

Frequently Asked Questions

How is a playbook different from a checklist?

A checklist tells you what to verify; a playbook tells you what to do, when it's triggered, and who owns it. The checklist is an artifact a play produces. You need both, but the playbook is what makes the work actually happen on schedule.

Who should own the program overall?

Operations, not data science. The hard part is consistency over time—running Play 7 every quarter, enforcing the launch gate under deadline pressure. That's an operational discipline, with data science as a specialist input rather than the owner.

What if we can't collect sensitive attributes legally?

You may be able to use validated proxies or aggregate-level analysis, but you must document the limitation. Inability to measure is a known gap to disclose, not permission to claim fairness you can't verify.

Can a small team run all seven plays?

Yes. The plays scale down—for a low-stakes use, Play 1 may end the sequence in a paragraph. The discipline is matching effort to stakes, which the playbook makes explicit rather than leaving to instinct.

How do we keep the launch gate from being rubber-stamped?

Give the gatekeeper a pre-set threshold and require a written rationale for any override. A gate with no objective criterion and no paper trail is theater; the threshold and the record are what give it teeth.

Key Takeaways

Every play needs a trigger and a named owner, or it won't run consistently.
Sequence matters: scope the decision and map groups before choosing a metric or auditing.
Lock your fairness metric and disparity threshold before seeing results to avoid metric-shopping.
Separate the person who builds the model from the person who gates its launch.
Production monitoring is the play most teams skip and the one that catches silent drift.

For deeper builds on individual plays, see the Framework and the Step-by-Step How-To.

The operating principle: plays, triggers, owners

Play 1: Scope the decision

Trigger: any project that uses AI to influence a decision about a person. Owner: project lead.

Play 2: Map the affected groups and collect attributes

Trigger: Play 1 marks the use as person-affecting. Owner: data lead.

Play 3: Choose and lock a fairness metric

Trigger: groups and attributes are defined. Owner: project lead with data lead.

Set the disparity threshold up front

Play 4: Run the pre-launch audit

Trigger: a model candidate is ready to evaluate. Owner: data lead.

Play 5: Mitigate, then re-audit

Trigger: the audit shows a disparity beyond threshold. Owner: data lead.

Work through mitigations in order of cost and reversibility:

Re-sample or augment underrepresented data—addresses root cause but slow.
Adjust group-aware thresholds—fast and transparent, but politically sensitive.
Reweight or constrain training—powerful but harder to explain.

Re-run Play 4 after any change. The Common Mistakes guide details how teams quietly trade a fixed disparity for a new one they didn't measure.

Play 6: Gate the launch

Trigger: re-audit complete. Owner: accountable executive (not the builder).

Play 7: Monitor in production

Trigger: model is live; fires on a calendar and on every data or model update. Owner: ops lead.

Sequencing the whole program

Frequently Asked Questions

How is a playbook different from a checklist?

Who should own the program overall?

What if we can't collect sensitive attributes legally?

Can a small team run all seven plays?

How do we keep the launch gate from being rubber-stamped?

Key Takeaways

Every play needs a trigger and a named owner, or it won't run consistently.
Sequence matters: scope the decision and map groups before choosing a metric or auditing.
Lock your fairness metric and disparity threshold before seeing results to avoid metric-shopping.
Separate the person who builds the model from the person who gates its launch.
Production monitoring is the play most teams skip and the one that catches silent drift.

For deeper builds on individual plays, see the Framework and the Step-by-Step How-To.

Seven Plays That Keep an AI Fairness Program From Becoming Theater

The operating principle: plays, triggers, owners

Play 1: Scope the decision

Play 2: Map the affected groups and collect attributes

Play 3: Choose and lock a fairness metric

Set the disparity threshold up front

Play 4: Run the pre-launch audit

Play 5: Mitigate, then re-audit

Play 6: Gate the launch

Play 7: Monitor in production

Sequencing the whole program

Frequently Asked Questions

How is a playbook different from a checklist?

Who should own the program overall?

What if we can't collect sensitive attributes legally?

Can a small team run all seven plays?

How do we keep the launch gate from being rubber-stamped?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Seven Plays That Keep an AI Fairness Program From Becoming Theater

The operating principle: plays, triggers, owners

Play 1: Scope the decision

Play 2: Map the affected groups and collect attributes

Play 3: Choose and lock a fairness metric

Set the disparity threshold up front

Play 4: Run the pre-launch audit

Play 5: Mitigate, then re-audit

Play 6: Gate the launch

Play 7: Monitor in production

Sequencing the whole program

Frequently Asked Questions

How is a playbook different from a checklist?

Who should own the program overall?

What if we can't collect sensitive attributes legally?

Can a small team run all seven plays?

How do we keep the launch gate from being rubber-stamped?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?