You’re on a team of data scientists who have been hired by a major bank to develop artificial intelligence (AI) algorithms that will help the bank identify and recruit talented new employees. You’re tasked with creating a machine-learning model to analyze applicants’ resumes and predict their suitability for the job. The application process is online, and the bank has indicated that gender and race are not factors to be considered in making hiring decisions.
What you may not realize is that the data you’re using to train your machine-learning model may already be biased against women and people of color. It’s not because the data was intentionally collected with the goal of discriminating against these groups. It’s because data often reflects existing biases in society. If you don’t take these biases into account when you’re developing your machine-learning model, your algorithm may end up perpetuating discrimination.
Inclusive design is a way of developing technology that takes into account the needs of people with a wide range of abilities, experiences, and perspectives. When it comes to AI, inclusive design means creating algorithms that work for everyone, regardless of gender, race, or other factors.
Caroline Criado Perez is an award-winning feminist campaigner and writer whose work focuses on the impact of gender data bias on society and the economy. In her book Invisible Women: Data Bias in a World Designed for Men, she documents how gender bias in data has led to a world designed for men, from the fit of motorcycle helmets to the design of public transportation. She also discusses how these biases can be addressed through inclusive design.
I recently interviewed Criado Perez about how data bias can lead to discriminatory AI and what data scientists can do to create more inclusive algorithms. Below is an edited transcript of our conversation.
How can data bias lead to discriminatory AI?
Caroline Criado Perez: All data contains some form of bias. There’s always going to be a bias in data, because data doesn’t exist outside of society. When we collect data, we’re necessarily going to reflect whatever bias is inherent in the society in which we’re collecting that data.
It’s important to be aware of these biases and to try to remove them. But they’re really hard to remove completely. They end up creeping back into our data, even if that wasn’t our intention.
These biases can influence the training data used to develop machine-learning models. For example, if the data used to train a machine-learning model for analyzing resumes is biased against women, the model may learn to discriminate against women. Once the model is deployed, it could prevent women from getting jobs they’re qualified for, or it could disproportionately select men for job interviews.
This is why it’s so important for data scientists to think about inclusive design when they’re developing machine-learning models. Inclusive design means creating products and services that work for everyone, regardless of gender, race, or other factors. It’s important to consider the needs of all potential users when you’re designing a machine-learning model.
You can’t just create a model and assume that it’s going to work for everyone. You need to think about whether the data you’re using to train the model is representative of everyone who might use the model. If it’s not, the model may not work equally well for everyone.
How can data bias creep into data sets?
Criado Perez: Often, data bias sneaks into data sets in ways that we’re not aware of. For example, if you’re collecting data about people’s incomes, you may find that the data is biased against women. This is because women are more likely to work part-time or in low-paid jobs. They’re also more likely to take time off to care for children or other family members. As a result, their incomes tend to be lower than men’s.
This data bias can have a big impact on machine-learning models. If you’re using income data to train a machine-learning model that predicts people’s creditworthiness, the model may learn to discriminate against women. This is because the model will associate low incomes with poor creditworthiness.
It’s important to be aware of these types of biases when you’re collecting data. But it’s also important to think about them when you’re using data to train machine-learning models. Data bias can be hard to spot, but it can have a big impact on the accuracy of your models.
What are some ways data scientists can create more inclusive machine-learning models?
Criado Perez: One way to create more inclusive machine-learning models is to use data augmentation. Data augmentation is a technique for artificially increasing the size of a data set. It’s often used to improve the performance of machine-learning models.
Data augmentation can also be used to make data sets more representative of the population. For example, if you’re training a machine-learning model to predict people’s creditworthiness, you may want to use data augmentation to artificially increase the number of women in the data set. This will help ensure that the model is trained on data that is representative of the population.
Another way to create more inclusive machine-learning models is to use transfer learning. Transfer learning is a technique for training machine-learning models on data sets that are similar to but not identical to the data set the model will be deployed on.
For example, if you’re developing a machine-learning model to predict people’s creditworthiness, you may not have access to a data set that is representative of the population. But you may be able to find a data set that is similar to the population you’re interested in. You can then use transfer learning to train your model on this data set.
Transfer learning can be used to make data sets more representative of the population. For example, if you’re training a machine-learning model to predict people’s creditworthiness, you may want to use transfer learning to train the model on data sets that are representative of the population. This will help ensure that the model is trained on data that is representative of the population.
Finally, data scientists can use pre-processing techniques to make data sets more representative of the population. Pre-processing is a technique for transforming data sets so that they’re more suitable for machine-learning algorithms.
For example, if you’re training a machine-learning model to predict people’s creditworthiness, you may want to use pre-processing to artificially increase the number of women in the data set. This will help ensure that the model is trained on data that is representative of the population.
What tools have you developed to help data scientists create more inclusive machine-learning models?
Criado Perez: I’ve developed two tools to help data scientists create more inclusive machine-learning models. The first tool is called the Gender Bias Testing Toolkit. The toolkit is a collection of resources that data scientists can use to test for gender bias in their machine-learning models.
The second tool is called the Data Augmentation Toolkit. The toolkit is a collection of resources that data scientists can use to artificially increase the number of women in their data sets.
Both of these toolkits are available on my website.