Theory about AI fairness is everywhere. A concrete procedure you can actually run on a Tuesday is rare. This article is that procedure. It assumes you have a model, some data, and the ability to slice results into groups. It does not assume you have a dedicated ethics team or a budget. The goal is to take you from "we should probably check for bias" to "here is the measured disparity and here is what we are going to do about it."
Work through these steps in order. Skipping ahead is the most common way bias audits go wrong, because each step produces the input the next one needs. By the end you will have a documented result that someone could defend in a review.
Step 1: Decide Which Groups You Are Protecting
Before measuring anything, name the populations you care about. This is a judgment call, not a technical one.
How to choose
- List the protected attributes relevant to your domain: race, gender, age, disability, language, or others specific to your context.
- Decide on the subgroups within each. "Age" might split into bands; "language" might split into the top five in your user base.
- Confirm you actually have, or can recover, this attribute for your evaluation data. If you cannot, you cannot measure fairness, only guess at it.
If you are still shaky on these terms, the Beginner's Guide defines protected attributes and proxies plainly.
A common stumbling block at this step is intersectionality. A model can look fair when you check gender alone and fair when you check race alone, yet fail badly for the intersection of the two. If your groups are large enough to measure reliably, define at least a few intersectional subgroups, not just one attribute at a time. The cost of ignoring this is a model that passes every single-axis test while quietly failing the people who sit at the overlap.
Step 2: Pick One Fairness Definition Up Front
Do not measure first and choose a definition later; that invites cherry-picking the metric that makes you look good.
The decision
- If equal selection rates across groups matter most for your use case, choose demographic parity.
- If equal error rates matter most, choose equalized odds.
- If a consistent score meaning across groups matters most, choose predictive parity.
Write down your choice and the reason. You cannot satisfy all three when base rates differ, so committing now keeps you honest. The trade-offs behind this are explained in the main guide.
Step 3: Compute Metrics Per Group
Now run the numbers, but never in aggregate only.
What to calculate
- For each group, compute the model's selection rate, true positive rate, false positive rate, and calibration.
- Build a per-group confusion matrix. Lay them side by side.
- Calculate the gap between the best-performing and worst-performing group for each metric. That gap is your headline number.
A model at 91 percent aggregate accuracy that runs 96 percent for one group and 70 percent for another does not have a 91 percent problem. It has a 26-point problem.
One practical warning: when a group is small in your evaluation set, its metrics will be noisy. A 70 percent accuracy on a subgroup of forty examples could swing several points just from sampling chance. Before you act on a gap, sanity-check that the group is large enough to trust the number. If it is not, the finding is not "this group is treated unfairly" but "we do not have enough data to know," which is itself a fairness problem worth recording.
Step 4: Trace the Gap to Its Source
A measured disparity is a symptom. Find the cause before you reach for a fix.
Diagnostic questions
- Is the smaller group underrepresented in training data? Thin data produces unreliable predictions.
- Were labels generated by a process that treated groups differently?
- Is a proxy feature carrying the protected attribute's signal?
- Did the metric you optimized reward majority performance at the minority's expense?
Each cause points to a different remedy. Reweighting fixes representation problems; relabeling fixes label problems; removing or transforming a feature addresses proxy problems.
Step 5: Apply the Cheapest Fix That Works
Mitigation has three points of intervention. Try them in order of how much control and cost each requires.
The order to try
- Pre-processing (data): reweight or resample to balance representation. Start here if you own the data.
- In-processing (training): add a fairness constraint to the objective. Use this when you control training and pre-processing was not enough.
- Post-processing (output): adjust thresholds per group. Reserve for black-box models you cannot retrain, and be aware it can resemble explicit group-based treatment legally.
Re-measure after every change. A fix that closes one gap can open another, which is exactly the kind of failure documented in the common mistakes article.
Resist the temptation to jump straight to post-processing because it is the fastest. Adjusting thresholds per group changes the output without touching the cause, so the underlying weakness, often thin data or bad labels, remains. It also raises the question of treating groups differently at the decision point, which in many regulated domains is precisely what the law forbids. Use it only when you genuinely cannot retrain, and when you do, write down exactly why.
Step 6: Document and Set a Monitor
An audit that lives in your head is worthless next quarter.
What to record and watch
- Write a one-page summary: groups, chosen definition, measured gaps before and after, and the fix applied.
- Add per-group metrics to your production monitoring, not just aggregate accuracy.
- Define a threshold that triggers a re-audit when subgroup performance drifts.
- Name the person accountable for the next review.
The documentation step feels like overhead until the first time someone questions the model. A written record of "we chose equalized odds, measured these gaps, applied this fix, and accepted this trade-off" turns a defensive scramble into a calm answer. It also protects whoever inherits the model after you, who will otherwise have no idea which fairness decisions were deliberate and which were accidents.
Frequently Asked Questions
How long does a first bias audit actually take?
For a model you already understand with accessible evaluation data, the measurement steps can be done in a few hours. The slow parts are getting the protected attribute for your evaluation set and agreeing on which fairness definition to use. Budget more time for the conversation than for the math.
What if I do not have the protected attribute in my data?
Then you cannot directly measure group fairness, and you should be honest about that limitation. Options include collecting the attribute for a representative evaluation sample, using a carefully validated proxy purely for auditing, or partnering with a team that holds the data. Guessing without it is not auditing.
Should I fix bias even if it lowers accuracy?
Often yes, because a slightly less accurate model that distributes errors fairly can be more defensible and more useful than a marginally more accurate one that fails a group badly. The decision depends on the stakes of the decision the model drives. Document the trade-off you accepted rather than pretending there was none.
Can I automate this whole process?
Parts of it. Metric computation and monitoring can be automated with fairness libraries. The judgment steps, choosing groups, picking a definition, and accepting trade-offs, cannot be automated and should not be. The tools survey covers what can be handed to software.
Key Takeaways
- Choose your protected groups and a single fairness definition before you measure anything.
- Always compute metrics per group and report the gap between best and worst, not the aggregate.
- A measured disparity is a symptom; trace it to data, labels, proxies, or the optimization metric.
- Apply mitigation in order of cost: data, then training, then output thresholds, re-measuring each time.
- Document the audit and add per-group monitoring with a drift threshold and a named owner.