There is a wide consensus among the AI community that bias is undesirable and that it may lead to very harmful consequences for users. So much has been written on building fair, transparent, and explainable AI. Then why do we keep hearing news of racial and gender bias in AI over and over again? What are we doing wrong?
In the case of tabular data, bias may be easier to spot and handle by using “blind taste tests” and by excluding variables which may lead to biased outcomes. However, when it comes to unstructured data such as images and video, current supervised learning approaches rely on patterns and annotations where bias is much more difficult to eradicate.
Bias in computer vision is a hard problem to solve also because AI, like its human creators, may be led to judge by appearance. When it comes to protected attributes such as race and gender, visual appearances may end up being tied to harmful assumptions and stereotypes, especially in controversial applications of AI such as surveillance and profiling.
There are tons of blunders produced by biased computer vision systems. Take for example the case of Google Photos tagging human faces as “gorillas” or Flickr tagging them as “ape”, as a result of the lack of ethnic balance in face detection models.
Another famous case is the saliency detection model used by Twitter which prioritizes white faces over black faces when deciding how photos are cropped. Zoom has also faced backlash against its virtual backgrounds feature which failed to recognize a black professor’s head and was automatically erasing it.
Models can also learn harmful and stereotypical correlations between humans and their context. Recently, AlgorithmWatch showed that Google Cloud Vision labelled an image of a dark-skinned individual holding a thermometer as “gun” while a similar image with a light-skinned individual was labelled “electronic device”.
Such biases can clearly pose a threat to individuals who are subject to AI systems. A recent paper discusses how pedestrian detection systems display higher error rates with people with darker skin tones. Another case is a governmental platform in New Zealand which rejected the photo of a man of Asian descent erroneously stating that the “subject’s eyes are closed”.
Gender bias is equally present in AI systems. For example, in one study, researchers fed pictures of congress members to Google’s cloud image recognition service. The top labels applied to men were “official” and “businessperson” while for women they were “smile” and “chin.” On average, photos of women were tagged with 3 times more annotations related to physical appearance than photos of men.
A famous experiment took place when emotion detection models were applied to a photo of the 1927 Solvay science conference, all of the men in the photo were detected as “neutral males” while Marie Curie was classified as an “angry woman”.
In a paper called “Women also snowboard” researchers found that captioning models frequently relied on contextual cues in order to determine whether the subject of the photo was a man or a woman. For example, the presence of a computer or sports equipment would immediately lead the AI to caption the image as “a man using a computer” even if the person was actually a woman.
Using an intersectional approach, projects such as Gender Shades work to analyze the intersection of race and gender in computer vision. In a seminal study, they have estimated that gender classification models perform with an accuracy of 99%-100% on white males but accuracy can decrease to 65% on black females.
The coded gaze
All of these examples show the “coded gaze” which has been imbibed into AI systems. This is a term coined by the Algorithmic Justice League in order to expose the illusion of neutrality in automated systems and to claim that they actually reflect the priorities, preferences, and prejudices of those who create them. In essence, AI models do not see the world with mathematical detachment but rather replicate and amplify historical cultural biases coded into them by their creators and annotators.
Many data scientists know the “garbage in, garbade out” refrain that refers to how deep learning models learn: they find patterns in the data they are trained on, and afterwards apply the same learnings on the new data they encounter. So if your training data is “biased”, your outputs will be “biased” as well.
In fact, a common finding is “bias amplification”, meaning that if the data is biased, the outputs will be even more biased. For example, in an image captioning scenario, if 70% of images with umbrellas include a woman and 30% include a man, at test time the model might amplify this bias to 85% and 15%.
How bias infiltrates datasets
The deep learning boom in computer vision in recent years has been prompted not only by the increased processing power for training neural networks but also by the availability of large-scale datasets for the first time, such as the iconic ImageNet. However, many of these “canonical” datasets were later found to contain plenty of harmful biases which were passed onto the AI models trained on them.
Many image repositories reflect a history of systemic underrepresentation of women and minority groups in the media and elsewhere. For example, the Labeled Faces in the Wild dataset, which was sourced through images of notable people in Yahoo! News, is estimated to contain 77.5% male and 83.5% white individuals.
This bias is amplified by the high proportion of stock photography in image repositories and search results which is known for perpetuating stereotypes against minorities and women. This occurs either by representing them in an exaggerated or sexualized way in particular categories or by under-representing them in generic categories (such as occupations, for example).
Some particular classes might be marked by absences as well. For example, this article finds that even though white and black people appeared in “basketball” images with similar frequency in one dataset, models learned to classify images as “basketball” based on the presence of a black person. The reason was that although the data was balanced in regard to the class “basketball”, many other classes predominantly featured white people while black people were absent from them.
Bad things can happen to good data
So, if we make a concerted effort to collect images that are representative of the global population, would that mean that our dataset is bias-free? Unfortunately not. Even if our dataset is extremely diverse, we might still end up with a biased model, and that is because of factors such as class definition and manual annotation.
Essentially, for AI models images are just pixels. Using unsupervised machine learning, we might be able to cluster them based on similarity, but we will not have any understanding of what these clusters are and what they mean. In order to give meaning to images and objects on them, we need metadata.
Depending on the end goal of the model, we might structure the class taxonomy in one way or another. This stage is much trickier than people usually think because it serves to define the core notions in the dataset. The whole endeavor of defining classes and labeling images has been referred to as a “form of politics, filled with questions about who gets to decide what images mean”.
Large-scale datasets like ImageNet or the Tiny Images Dataset which had the goal of “mapping the entire world of objects” were built upon the taxonomy of WordNet, a hierarchical dictionary of English nouns. However, subsequent studies found a variety of biased and offensive labels, especially in the “person” category. This was the reason why in 2020 the Tiny Images Dataset was taken down by its creators amidst revelations of the prejudiced and offensive classes it featured, such as nearly 2,000 images labeled with the N-word, harmful slurs, as well as pornographic content.
If class taxonomy is so important, how should we go about it? For example, if we are trying to build a model for race classification, what classes should we choose?
Most common datasets divide people into 4 groups: ”Caucasian”/”White”, “Asian”, “African”/”Black” and “Indian” – which completely oversimplifies human diversity and is prone to mis-place people who do not fit completely into these categories or are in between them.
Furthermore, even if a dataset seems to contain a balanced representation of the “Asian” class, that does not mean that “East Asian”, “South Asian” and “Southeast Asian” would be equally represented, or that within an “East Asian” class there would be a good balance between “Japanese”, “Korean”, and “Chinese”. And as much as we try to break down each class into more representative subclasses, there will always be some groups that are rendered invisible because there is no class assigned to them.
In order to escape such restrictive taxonomies, some researchers have proposed to group people based on their skin color as a proxy for race (for example ”Light”, “Medium” and “Dark-skinned”, which is arguably simplistic as well) or using the Fitzpatrick skin type classification system. The most advanced approaches discard race as a whole and instead use variables such as craniofacial distances, areas, and ratios, as well as facial symmetry and facial contrast. However, these have been accused of reverting to outdated pseudoscientific methodologies like craniometry.
Choosing a taxonomy for gender classification is equally complicated. Existing datasets have divided people into “Male” and “Female”, ”Male” and “non-Male”, ”Male”, “Female” and “Neutral” or ”Unsure”, or even ”Gender 1” and “Gender 2” which, despite its intention, did not hide the fact that one group was male while the other one was female. Perhaps the most appropriate approach so far has been to represent gender as a continuous value between 0 and 1 instead of a binary.
Nonetheless, in many of these cases, the classification is trans-exclusive and the assumption is that the person’s gender expression reflects their gender identity which may not be true. If a person has a masculine expression, does that mean that they are male? Can gender be even captured in a visual way?
This article presents a novel approach in which a dataset was composed of public Instagram images classified in 7 genders using the hashtags that the authors of the images posted themselves. This is a great example of giving agency for self-determination to the people appearing in datasets instead of having another person to label them.
To label or not to label?
All of these examples show that race and gender classification are two inherently problematic applications of computer vision, given that both notions are social constructs. They are not objective visual characteristics and the latest consensus in the AI community is that both of these classifications are reductionist and they may cause significant distress to the people who are being classified.
This is the reason why in early 2020 Google switched off its AI vision service’s gender detection, saying that gender cannot be inferred from a person’s appearance and it now applies “person” rather than “man” or “woman” on all images.
One of the keys to handling the issue of gender and racial bias in AI is to adopt an intersectional and decolonial stance in order to center vulnerable communities who continue to bear the negative impacts of scientific progress.
If a model’s goal is to detect gender and/or race, it’s not only recommended to consider dropping such classifications and taxonomies altogether but to question and rethink the entire purpose of the AI system. Who is developing it? Why are they developing it? What are the power imbalances? Does this model align with the interests of the people who will be subject to it? Is this model ultimately going to contribute to the newest form of oppression: the algorithmic one?
These are all hard questions and frequently AI practitioners tend to avoid them by thinking: “I am just delivering what my client is asking for”. But by raising awareness of such issues across the value chain, we are contributing to building more fair, transparent, and trustworthy AI systems for all of us.