Recommendation Engines Need Fairness Too!
July 16, 2020
As we turn to digital sources for news and entertainment, recommendation engines are increasingly influencing the daily experience of life, especially in a world where folks are encouraged to stay indoors. These systems are not just responsible for suggesting what we read or watch for fun, but also for doling out news and political content, and for surfacing potential connections with other people online. When we talk about bias in AI systems, we often read about unintentional discrimination in ways that apply only to simple binary classifiers (e.g. in the question “Should we let this prisoner out on parole?”, there are only two potential predictions yes, or no). Thinking about mitigating bias in recommendation engines is much more complex. In this post, we’ll briefly describe how these systems work, then surface some examples of how they can go wrong, before offering suggestions on how to detect bias and improve your users’ experience online, in a fair and thoughtful way.
Part 1 The Anatomy of a Recommender System
If you’re coming to this article as someone who regularly builds or works on recommender systems, feel free to skip this part. For those of you needing a refresher or primer on the topic, read on!
Recommender engines help companies predict what they think you’ll like to see. For Netflix, YouTube and other content providers, this might happen in the format of choosing which video cues next in auto-play. For a retailer like Amazon, it could be picking which items to suggest in a promotional email. At their core, recommender systems take as input two “sides” of a problem -- users and items. In the case of Netflix, each user is an account, and each item is a movie. For Amazon, users are shoppers, and items are things you can buy. For YouTube, users are viewers, items are videos, and a third component are the users that create the content. You can imagine analogues with newspapers and other media sources such as the New York Times and the Wall Street Journal, music streaming services such as Spotify and Pandora, as well as social networking services such as Twitter and Facebook.
Users rate some items, but not all of them. For example, even if you binge watch shows on Netflix, it’s unlikely that you have rated even a small fraction of Netflix’s vast content catalogue, much less so when it comes to YouTube’s library, where over 300 hours of content are uploaded every minute. A recommender system’s goal is, given a user, find the items or items that will be of greatest interest to that user, under the assumption that most items have not been rated by most users. How is this done? By learning from other, similar items, similar users, and combinations of the two.
Recommender systems recommend content based on inductive biases. One common inductive bias is that users who seem similar in the past will continue to seem similar in the present and future. In the context of recommender systems, this means that users who have, for example, rated videos similarly on YouTube in the past will probably rate videos similarly moving forward. Recommendations based on this intuition might try to find similar users to a particular user, and similar pieces of content to a particular piece of content, and then combine learnings from those two neighborhoods into an individual score for that particular pairing of a user and item. By doing this for every user-content pair, the recommender system can “fill in all the blanks”, that is, predict a rating for each combination of user and piece of content. After that, it is simply a matter of picking the most highly-rated pieces of content for that customer, and serving those up as you might see in a sidebar on YouTube or a “view next” carousel on Amazon Shopping.
Part 2 What Could Go Wrong?
As we’ve discussed above, recommender engines attempt to “fill in the blanks” for a particular user by guessing at their level of interest in other topics when we only know how they feel about things they’ve already seen or read. Most recommender engines are a blend of “nearest neighbor” calculations and active rating elicitation, using a combination of supervised and unsupervised learning alongside deterministic rules that modify the selection process among the content that you could potentially recommend. To discuss some of the issues that often arise in recommender engine bias, we’ll look at a couple of examples from industry that illustrate the nuance and complexity involved.
One of the more common issues we see in industry can be illustrated by YouTube’s spectacularly named “Gangnam Style Problem”. The problem is this no matter what content you recommend to your user, when one looks at the potential pathways they could take from one recommendation to the next, they all lead back to whatever happens to be the most popular video that day. While this may be good news for PSY and K-pop stans worldwide, gaining traction within a recommender engine can make or break the experience for someone creating content on these platforms, where they need their content to be seen in order to survive.
More so every day, we hear about complaints from within the YouTube creator community, claiming that their channels suffer from this disparity, and that YouTube is biased against emerging artists. Thinking this through from a business perspective, it’s easy to see why this might be the case YouTube wants to keep users on the page, and they’re more likely to do that if they can show you content that they know you’ll actually like. In fact, the less YouTube knows about how users will interact with your particular brand of content, the more risky it becomes to promote it.