Fairness in Machine Learning is Tricky
April 8, 2020
Non-experts and experts alike have trouble even understanding popular definitions of fairness in machine learning — let alone agreeing on which definitions, if any, should be used in practice.
Human decision-making processes are known to be biased. Look at the promotions process at a typical large company. In the words of Tomas Chamorro-Premuzic, the Chief Talent Scientist at one of the world’s largest staffing firms, “… most companies focus on the wrong traits, hiring on confidence rather than competence, charisma rather than humility, and narcissistic tendencies rather than integrity.”
This observation isn’t new, and myriad other examples exist. In Australia, if the name on your resume “sounds” Middle Eastern or Chinese, you are less likely to be hired; in the US, replacing a male-sounding name with a female-sounding one on an otherwise identical CV can result in a lower chance of offer, as well as a lower starting salary. When giving a negative medical evaluation, doctors exhibit similar levels of implicit bias based on race and gender as the general public might — which is to say, quite a bit. And, while open and explicit redlining for banking and insurance in the US is no longer legal, there is evidence that its impacts are still implicitly felt today.
Can we just take humans out of the loop and rely on cold, hard data? Initially, some proponents of automated decisioning techniques stemming from the data mining and machine learning (ML) communities pushed for this. But algorithms take data as input, along with whatever biases come along with how those data were — or weren’t — sampled. This includes: which features were stored, which humans made those decisions, and disparities in sample sizes of different subgroups of inputs, among others. Worse still, often any resultant discrimination is an emergent, i.e. learned, property of the system (rather than a hand-coded rule written by an algorithm designer), making identification and any partial mitigation of the issue more difficult.
Remember our earlier examples of bias in human decision-making systems for promotions, hiring, healthcare, and banking? Their counterparts have all been found in the automated versions of those systems as well. So, bias in ML-based decisioning processes is ubiquitous, just like bias in human decision-making processes. Can we do anything about it? Yes, and we should — but it’s important to set our goals realistically, and keep both stakeholders and domain experts heavily and continually involved.
One approach taken by the machine learning community is explicitly “defining fairness” — that is, proposing different metrics of fairness as well as approaches to encoding that into machine learning pipelines.
Some of these definitions are explicitly or implicitly based on existing legal doctrine. For example, channeling Title VII of the Civil Rights Act of 1964, an algorithm is said to result in disparate impact if it adversely affects one group of people of a protected characteristic (aka “sensitive attribute”) over another. Similarly, that algorithm is said to result in disparate treatment if its decisioning is performed in part based on membership in a group. Then, one goal that a fairness in machine learning practitioner might have is to mathematically certify that an algorithm does not suffer from disparate treatment or disparate impact, perhaps given some expected use case or input distribution.
Toward that end, let’s dive a bit deeper into disparate treatment — specifically, what does it mean to make a decision based on membership in a group? Well, the algorithm could formally discriminate, that is, take as input explicit membership in a group, and then use that in some way to determine its output. This is often illegal to do, so many systems already do not do this. Yet, given a rich enough dataset, membership in a protected group is almost surely redundantly encoded, at least to some extent, by other features. Indeed, unless the target is completely uncorrelated with membership in a protected group, given enough data a well-trained model will completely recapture the protected group membership’s impact on the target — without ever having explicit access to that particular (protected) feature!
A standard example treats “race” as a protected group in a dataset including features such as “zip code.” Here, observing “zip code” alone often provides strong signal about “race” — even without explicit access to the “race” feature. So, do we remove “zip code” from the input as well? Maybe not, because it’s quite likely some other set of features correlate with “zip code,” too, and we’re back to the drawing board. Indeed, and in general, it’s not immediately clear how to write down a formal set of rules to enforce accepted legal definitions of fairness in decisioning systems.
It’s not the point of this article to completely overview the state-of-the-art in definitions of fairness in machine learning; many definitions have been proposed,¹ and many people have written² about them. My point so far is that it’s tough or impossible to write down agreed-upon legal rules and definitions using formal mathematics — even for “simple” systems performing binary classification on relatively small, well-defined inputs. For the sake of discussion, though, let’s say we have decided on a definition of fairness,³ and we have been able to write it down using formal mathematics, ready to be put into our ML pipeline. Now what?
Given a well-defined definition of fairness implemented in a machine-learning-based system, it is natural to ask what the people impacted by that system (i) understand about the system itself and (ii) think about the rules under which it is operating.
Ditto with the operators of the system, as well as other stakeholders (e.g., policymakers, lawyers, domain experts). And, when different classes of stakeholder have different opinions about what “fairness” means, how should we manage that?
Let’s start with a simpler setting: asking one class of stakeholder if they comprehend well-known definitions of fairness. In joint work⁴ with researchers at Maryland and Berkeley ICSI, we recently did just this: we created a metric to measure comprehension of three common definitions: demographic parity, equal opportunity, and equalized odds, and then evaluated it using an online survey with the goal of investigating relationships between demographics, comprehension, and sentiment.
In our study, fairness definitions were presented in multiple real-world scenarios (e.g., in our vignette on hiring, demographic fairness was described as “[t]he fraction of applicants who receive job offers that are female should equal the fraction of applicants that are female. Similarly, the fraction of applicants who receive job offers that are male should equal the fraction of applicants that are male”). Then, comprehension and sentiment questions were asked. Some takeaways:
- Education is a strong predictor of comprehension, at least for the accessible explanations of fairness used in our study. The negative impacts of ML-based systems are expected to disproportionately impact some segments of society, for example by displacing employment opportunities for those with the least education. Thus, that already-at-risk group’s ability to effectively advocate for its members may be adversely impacted by lower comprehension.
- Weaker comprehension correlates with less negative sentiment toward fairness rules. One way to interpret this is that those with the lowest comprehension of fairness concerns in ML systems would also be the least likely to protest against it.
One promising direction is to learn stakeholders’ views about fairness via simulation or observation of actions over time. Some research has been done in this space. for example, researchers at ETH Zürich fit functions to users’ preferences over a finite and pre-determined feature space using pairwise comparisons of simulated profiles. They found that the well-known notion of demographic parity aligned reasonably well with their human subjects’ responses. One outstanding issue in this study, and in most preference and moral value judgment aggregation studies in this space, is the lack of consideration of different classes of stakeholder. For example, how can we combine the input of a layperson with that of a (domain expert) doctor in a healthcare setting — especially when people’s judgments often disagree?
So what do stakeholders of the same type want, in general? Perhaps unsurprisingly, leading technology firms such as Microsoft and Google have taken steps in this direction with respect to the applied production settings that their engineers encounter. For example:
- Researchers at Microsoft Research surveyed practitioners from 25 ML product teams in 10 major technology firms, and found (broadly) that the “fair ML” research literature focuses too specifically on methods to assess biases, and would benefit from focusing more broadly on the full machine learning pipeline. A striking example from their paper involved an image labeler in a computer vision application systematically labeling female doctors as “nurses,”⁵ which would then serve as “gold standard” input to any downstream algorithms.
- Researchers from Google recently published a case study of a fielded classification system where adverse actions are taken against examples predicted to be in the positive class (e.g., “if you are predicted to be a spammer, then I will deactivate your account”). They found that it is difficult to even measure a direct-from-the-literature definition (equality of opportunity), and then give a series of steps they took to build a more applicable tweak to that definition into their models.
Research suggests that: (i) laypeople largely do not understand the accepted definitions of fairness in machine learning; (ii) those who do understand those definitions do not like them; (iii) those who do not understand them could be further marginalized; and (iv) practitioners are not being served well by the current focus of the fairness in ML community. It sounds negative, but there are explicit next steps to take to help mitigate these issues. Read on!
Earlier, I asked if we could “do anything about it” when it comes to the difficulties — and often, the impossibilities — of deciding on, and enforcing, fairness in decisioning systems. The answer is, still, that we can — but responsibly, with input from all appropriate parties, and with an understanding that there is no panacea.
Below, I give a non-exhaustive list of action items for researchers and practitioners in the “fair ML” space.
- We need to understand what (lay)people perceive to be fair decision making. Given an explicit definition of fairness, is it understood, and is it acceptable to a wide audience? If particular subgroups of the general population do not comprehend parts of automated systems that impact them, then they will be more easily disadvantaged.
- Dovetailing with the above, we need to understand what specialists perceive to be fair decision making as well — and what tools they would need to help them do their jobs. This could mean developing tools to help audit systems, or to better curate high-quality and well-sampled input datasets, or to permit faster exploratory data analysis (EDA) to help find holes in the input and output of prototype or deployed systems.
- We need techniques that, given a definition of fairness or of bias, can measure at enterprise-scale whether or not an ML-based system is adhering to that definition or those definitions — and, if not, (i) describe by how much and (ii) alert humans, when appropriate, if the system deviates beyond an acceptable level.
- Additionally, effective UI/UX will be required to allow stakeholders of all walks to comprehend the state-of-the-art in various fielded automated systems. Fielded systems ingest (incomplete) high-dimensional data and output high-dimensional data, over time. Communicating the state of a system vis à vis particular definitions of fairness and bias in a human-understandable way is paramount.
- Quoting directly from the Microsoft Research study discussed earlier, “[a]nother rich area for future research is the development of processes and tools for fairness-focused debugging.” Debugging tools with a fairness focus would help practitioners identify, e.g., under-sampled portions of an input dataset, or previously overlooked subgroups being adversely impacted by new decisioning rules.
- Finally, we need to develop shared languages between all involved parties, but particularly engineers, laypeople, and policymakers. Engineers implement, policymakers make society-wide rules — and laypeople are impacted by the interaction between the two. All three need to understand the wants, incentives, and limitations of the others through open and continuous communication.
Throughout, it is important to balance prescriptive and descriptive approaches to understanding, measuring, and implementing fairness in machine-learning-based systems. Prescriptive approaches necessarily assume some consensus around what “should” occur under a (societally decided-upon) definition of fairness, whereas descriptive approaches focus more on uncovering what that consensus might be in the first place.
Researchers and practitioners interested in fairness in machine learning — myself included! — have focused too much on the former, largely due to its amenability to mathematical characterization. Yet, the latter is nowhere close to understood, and is an absolutely necessary complement if not precedent to more formal prescriptive approaches. That will, of course, require in-depth discussions with stakeholders of all walks — laypeople, policymakers, politicians, ethicists, lawyers, and domain experts. Yet, these discussions will need to be complemented with accurate and scalable techniques that measure and communicate real-world systems’ adherence to various definitions of bias and fairness in machine learning — so that they can provide human feedback to further improve automated decision systems performance in practice.
Thanks to Michelle Mazurek, Liz O’Sullivan, and Monica Watson for comments on earlier versions of this piece.
 I’d urge you to check out the (free, ever-updating) Fairness and Machine Learning book by field experts Barocas, Hardt, & Narayanan, a formally published overview by to-be-ArthurAI researcher Verma & Rubin, or the proceedings of area-specific conferences such as FAccT and AIES.
 I’d typically also recommend Wikipedia, but at the time of writing, the Wikipedia page for Fairness (machine learning) is a bit of a mess, with fourteen different binary-classification-centric fairness criteria roughly defined amongst a mess of mathematics and essentially nothing else. This one sentence in the third paragraph of the introduction really sums it up, though: “The algorithms used for assuring fairness are still being improved.” Still a ways to go!
 This is a pretty strong assumption! Indeed, it’s almost always impossible to create a system that ensures three “reasonable” definitions of fairness, even in binary classification: calibration, a form of proportional treatment based on relative group size; balance for the negative class, which roughly states that people in the (true) “zero” class should be scored the same; and, balance for the positive class, which is its complement for the (true) “one” class.
 Our study is ongoing. The working paper is available as “Measuring Non-Expert Comprehension of Machine Learning Fairness Metrics,” with authors Debjani Saha, Candice Schumann, Duncan McElfresh, John Dickerson, Michelle Mazurek, and Michael Tschantz. An initial report on our study, titled “Human Comprehension of Fairness in Machine Learning,” appeared at the 2020 ACM/AAAI Conference on AI, Ethics, & Society (AIES-20).
 Here, the practitioners used a “failsoft” solution to mitigate this input bias and combined the “nurse” and “doctor” labels in their input dataset. Yet, without explicit monitoring, this systematic labeling error likely would’ve gone undiscovered.