Explainable AI

An Overview of Counterfactual Explainability

An Overview of Counterfactual Explainability

Within the field of explainable AI (XAI), the technique of counterfactual explainability has progressed rapidly, with many exciting developments just in the past couple years. To help crystallize and understand the major development areas, we’ll be presenting a new paper at the NeurIPS Workshop on ML Retrospectives, Surveys, and Meta-Analyses. This post will provide a high-level summary, but if you’re interested in getting started in this area, we encourage you to check out the full paper.

What are counterfactual explanations?

Counterfactual explanations (CFEs) are an emerging technique for local, example-based post-hoc explanations methods. Given a datapoint A and its prediction P from a model, a counterfactual is a datapoint close to A, such that the model predicts it to be in a different class Q (P ≠ Q). A close datapoint is considered a minimal change that needs to be made to get a different prediction. This can help in scenarios like rejection of a loan or credit card request, where the applicant is willing to know about the smallest change in the feature set that can lead to acceptance of the request.

Explicitly, consider a person who applies for a loan and is rejected. In our simple example, say that the person is represented by a length-3 feature vector x: liquid assets of $10K; outstanding debt of $50K; and annual income of $45K. So, x = ($10K, $50K, $45K). Our fictitious loan agency uses a pre-trained binary “loan acceptance” classifier, f, that takes as input length-3 feature representations of applications and returns one of two labels: y = 0 (reject), or y = 1 (accept). Here, then, the applicant is rejected because f( x = ($10K, $50K, $45K) ) = reject. Roughly speaking, a counterfactual explanation for this decision (reject) would describe changes, call them x’, to the applicant’s initial feature vector x such that f( x’ ) = accept. There may be many possible counterfactual explanations: for example, lowering outstanding debt from $50K to $25K to form x’ = ($10K, $25K, $45K); or, increasing liquid assets from $10K to $20K to form x’’ = ($20K, $50K, $45K). Some may be easier or harder to attain for the applicant -- and some may be completely impossible to achieve -- motivating research into the creation of “the best” counterfactual explanations for a particular use case, which we discuss in greater depth below.

Fig1: Illustrative diagram counterfactual explanations. The datapoint labeled X (blue) got classified in the negative class. CF1 (red) and CF2 (green) are two counterfactuals for X, which the model classifies in the positive class. Several counterfactuals can be generated for a datapoint, which differ in closeness to the original datapoint and other desirable properties.

Themes of research in CFEs

Much of the literature in counterfactual explanations have proposed algorithms to address additional aspects of the problem. We categorize recent research into the following major themes:

Actionability: A CFE is only useful if it prescribes changes to features that can actually change. It would be unhelpful if I were told to change my birthplace in order to receive a loan.

Sparsity: A useful CFE should modify only a few features in order to be simple and easy to use.

Proximity: A useful CFE should be the smallest possible change that achieves the desired outcome.

Causality: A useful CFE must be able to adhere to any causal constraints that a domain expert specifies. For example, I should not have to decrease my age in order to get a loan.

Data Manifold: A useful CFE should result in a datapoint that is similar to other datapoints seen in the training data. It would be less trustworthy if the resulting datapoint is utterly unlike anything the classifier has ever seen.

Speed: CFEs should be generated quickly for new, incoming datapoints.

Model Access: Some CFE approaches require detailed knowledge of model internals and gradients. Others can work in a black-box fashion and are model-agnostic.

In our survey paper, we collect, review, and categorize 39 recent papers that propose algorithms to solve the counterfactual explanation problem. We design a rubric with desirable properties of counterfactual explanation algorithms and comprehensively evaluate all currently-proposed algorithms against that rubric. This provides easy comparison and comprehension of the advantages and disadvantages of different approaches and serves as an introduction to major research themes in this field. We also identify gaps and discuss promising research directions in the space of counterfactual explainability.


CFEs present a compelling form of XAI, providing users with understandable and actionable feedback. The additional constraints and desiderata explored in recent years seek to ensure that these explanations are always reasonable and useful. Many exciting open questions remain, and we close our paper by proposing research challenges for the community to tackle in the coming years. We firmly believe that CFEs will form a long-lasting part of the ML explainability toolkit.