Our top takeaways from NeurIPS 2020 on Responsible Machine Learning

Our top takeaways from NeurIPS 2020 on Responsible Machine Learning

No longer the niche sub-field it was several years ago, "fair" ML and related topics have now become a core part of the field of machine learning. At NeurIPS 2020, the broader theme of responsibility (even beyond fairness) in ML has a much larger footprint at the conference; on a more granular level, fair ML as a field overall has also developed substantially, asking a broader range of questions critical to operationalizing fairness.

From the start of the conference, NeurIPS 2020 emphasized the importance of thinking deeply about the implications of work in machine learning research: the conference kicked off with a keynote from Georgia Tech's Charles Isbell, titled "You Can't Escape Hyperparameters and Latent Variables: Machine Learning as a Software Engineering Enterprise." Despite its somewhat dry title, the talk is anything but. Isbell makes the case for machine learning researchers to approach their work in the same way as software engineers must, considering how the usage of software must shape its development. Isbell dispells the myth of bias being "only due to biased datasets" and therefore something that individual scientists need not be concerned about; this keynote set the tone for how the rest of the conference unfolded. (You can watch the full video of the keynote; this is nothing like any pre-recorded talk I've ever seen, and I cannot recommend it more strongly.)

The growing community around fair/responsible ML is clear not only in the papers accepted to the main conference, but also in the workshops, several of which were dedicated to discussing concerns related to developing fair ML—Resistance AI, Dataset Curation and Security, Fair AI in Finance, Algorithmic Fairness through the Lens of Causality and Interpretability, Consequential Decisions in Dynamic Environments—and several more where Retrospectives, Surveys, and Meta-Analyses, ML for Economic Policy, Broader Impacts of AI Research, Human and Machine-in-the-Loop Evaluation and Training Strategies. In the rest of this piece, I will be highlighting contributions from both workshops and the main conference.

At Arthur, we’re really excited about the development of these research directions:

Fairness for a broader range of algorithms and applications.

Most early work in fair ML focused on binary classifiers with a single protected attribute; today, the vast majority of available open-source implementations are also only available for binary classifiers with a single protected attribute. Of course, there are many other problem settings where fairness may be a concern, and this year's NeurIPS saw the introduction of many new fair ML algorithms for many more scenarios: clustering, streaming, online learning, regression models, overlapping group membership, multiple classes, and scenarios without access to information about demographics.

More nuanced ways of measuring, understanding, and communicating fairness.

While metrics of fairness based on output distributions and group-conditional error rates have been common for some time now, some exciting lines of work have been developing around more nuanced approaches to measurement. Can I Trust My Fairness Metric? emphasizes the high variance of fairness metrics when batch or dataset sizes are small, and introduces a way to generate more accurate and lower-variance estimates of metric values; meanwhile, Measuring Bias with Wasserstein Distance, from the Dataset Curation and Security workshop, proposes an alternate metric for fairness that captures inequity that may be missed by current standard metrics. Opportunities for a More Interdisciplinary Approach to Perceptions of Fairness in ML, from the ML Retrospectives, Surveys, and Analyses workshop (ML-RSA), draws insight from the field of psychology to discuss how common fairness metrics are interpreted by human end-users.

Analyzing the qualities of fair algorithms, such as robustness and downstream implications.

Complementary to the development of fair algorithms themselves, a substantial amount of new work discusses how fair algorithms might perform under a variety of conditions and situating the algorithms in the broader context of deployment. For example, How Do Fair Decisions Fare in Long-term Qualification? considers the impact of static fairness constraints on long-term well-being; the Workshop on Consequential Decision Making in Dynamic Environments was dedicated to work in this area. Similarly, Fair Multiple Decisionmaking Through Soft Interventions considers the scenario where there are multiple, potentially-interacting algorithms. In a parallel thread, Ensuring Fairness Beyond the Training Data proposes the first (to my knowledge) algorithm to train a fair classifier that is provably robust to a set of possible distribution shifts. More broadly, the Workshop on Algorithmic Fairness through the Lens of Causality and Interpretability has work contextualizing fair algorithms in many ways, such as Fairness and Robustness in Invariant Learning, which connects topics from causal inference and domain generalization to fairness. The Fair AI in Finance workshop also touches on related issues, particularly with regards to the application.

Concerns of fairness and bias complement concerns of model robustness writ large. Indeed, discussion of the “robustness” of a machine-learning-based system necessarily includes discussions of dataset and model security, policy and privacy, dataset and model bias, data ingest via scraping and labeling, amongst numerous other considerations. Toward that end, joint with colleagues at CMU, IBM, Illinois, Maryland, and TTIC, our Chief Scientist John Dickerson hosted the Workshop on Dataset Curation and Security, which brought together researchers from the adversarial ML and fairness in ML communities as well as policy wonks from the Brookings Institute and other “tech-adjacent” bodies for a full day of discussion of what it means to claim, and what it might take to improve, “robustness” of machine learning models. In short, it is hard to make statements about model behavior (vis-à-vis, for example, fairness) without also deeply considering other dimensions such as security and the legal landscape. Our VP of Responsible AI, Liz O’Sullivan, also gave an invited talk at this workshop on some of the dangers of scraping data from the Internet, focusing primarily on how this can take agency from unknowing humans and may result in otherwise unexpected or undefined behavior. Similar concerns and sentiments were raised at other NeurIPS workshops, in papers, and in the aforementioned invited talk by Charles Isbell; we expect this trend to continue in the coming months and years.

Beyond fairness: bigger-picture views of algorithmic (in)justice.

It's been clear for a while now that it is not enough for algorithms to simply be "fair" — there are several related topics of technical interest, as well as broader considerations that must be made when analyzing the impact of algorithms in the world. The ML-RSA workshop has several works in this category, such as Arthur's own Counterfactual Explanations for Machine Learning, as well as A Survey of Algorithmic Recourse. ML-RSA also has some more critical work, such as Data and its (dis)contents: a survey of dataset development and use in ML research. Finally, it behooves everyone working in fair ML to take a look at the work from the Resistance AI workshop, which is full of thought-provoking work, both technical and non-technical, that questions the way power is arranged and rearranged by ML systems.