Who is Accountable for Data Bias?

Charna Parkey

Accountability for misuse of data is a big question in using data science and machine learning (ML) to advance society. Are the data collectors, model builders, or users ultimately accountable?

The benefits of data sharing are widely recognized by the scientific community, but headlines can also be seen in the news about models that are released with known bias or without any impact monitoring and reporting in place. Examples include “Florida scientist says she was fired for not manipulating COVID-19 Data” and “Google Researcher Says She Was Fired Over Paper Highlighting Bias in A.I.” after a paper by Timnit Gebru was accepted that highlighted the risk of large language models.

Organizations such as the World Health Organization (WHO) have pages of policies qualifying how data were collected, the limitations, and restrictions on use. At the same time, whistleblowers and researchers alike are pushing back, attempting to hold companies and states accountable for their misuse of data.

While there is no clear answer, the question of accountability at multiple levels can be explored, as well as how to begin implementing systems of accountability now instead of waiting for regulations to provide guidance.

What is Accountability?

Merriam-Webster defines accountability as “the quality or state of being accountable, especially an obligation or willingness to accept responsibility or to account for one’s actions.” Accountable is defined as “subject to giving an account: answerable”; “capable of being explained: explainable.”

Systems of accountability can be found in every discipline today, including data science and machine learning. Sometimes, these systems are due to obligations to comply with laws and regulations, but ultimately, other people will hold providers of data accountable. In other cases, individuals, companies, and society are choosing to accept responsibility in the absence of clear obligation.

In many large companies, for example, it is now common practice to have an explicit and public code of ethics and accountability—see IBM’s Design for AI Accountability, Microsoft’s Approach to Responsible AI, etc.

The good news is that data are inspectable, and the impact of data science and machine learning algorithms can be measured. The bad news is that a company can claim to be accountable without each person knowing and holding each other accountable, which is called ethics-washing and is covered in a recent Harvard Business Review article, “The ethical dilemma at the heart of big tech companies.”

Ultimately, accountability is more than measurement and inspection; it is also holding those who are accountable responsible for their actions and impact.

Who is Accountable?

The short answer to the question of who is accountable is “everyone.” The long answer is, it depends on what you are contributing. ML models do not exist in isolation. They exist in the context of the changing world. Researchers and creators of data sets must understand what component they are individually accountable for and how far that accountability extends.

To do this, the continuous cycle of a machine learning model can be broken into four phases: data collection, feature engineering, model creation, and impact. It would be overly simplistic to assume that all of these phases are under an individual’s personal control forever. Instead, look at each phase as its own independent system, each done by different people or built oneself and handed off for someone else to use and maintain.

There is no beginning or end to the continuous cycle; it is a circle. For each phase in the circular system, both the previous and next phase must be considered. Each individual is accountable for the impact of their phase and will have to hold the previous phase accountable for their representations. If anyone skips a step in reinforcing accountability, they put themselves at great risk when handing off the product and making representations about it.

Data Collection

In most ML systems, the first phase is collecting and cleaning data. To create or maintain a system of accountability requires providing, in a human-readable way, detailed records of the data collection design and decision-making processes that can be reviewed in all later phases. Then, you will have to look at the world and see whether the data you are collecting are still valid and relevant, and if the way you have been collecting the data is still ethical. (For more about ethical data collection, see the Towards Data Science blog The ethics of Data Collection. Finally, you must understand the laws, regulations, and guidelines that may restrict the collection, retention, sharing, and licensing of the data.

Consider that even the collection of data can be harmful if the information is used beyond the permission for use. A good example of this is ethical health research: Providing personally identifiable data beyond where a patient has given explicit permission risks disclosure to an employer, insurer, or family member. Such a breach can result in stigma, embarrassment, and discrimination. Someone who sells such data, or has the data breached while in their possession, is accountable to the people who are affected.

Since all four phases are in a cycle over time and feed into each other, there is always a previous and next phase, even though this may be the first phase in an ML system. In this case, the previous phase was Impact. Because the world changes over time, the data often will, too—the data either could have an expiration date or require being refreshed.

You are collecting data to engineer features, serve modes, and make more impact—but you do not know what your data will be used to create. The only way to restrict the use of data is through using another system of accountability, licensing, or privacy laws.

Moreover, the moment that data are collected, they are imperfect: They are biased in some way. The questions asked in a survey, for example, can create bias toward some answers, and responses are biased when respondents are self-reporting measurements. Even static data that are believed to be facts can change, as shown in concrete examples at Max Boyd’s blog Your customers are changing, your data should too.

Even so, as the world changes, data begin to go stale. When it is time to hand things off, be sure to provide the documentation with the data.

Feature Engineering

As this process continues, begin to imagine a genetic test report such as 23andme’s ancestry lineage graphs. From a data point of oneself, a researcher can connect more data points—such as biological ancestors—to get even more historical information and become more certain about their ancestors and future health scenarios. Connecting with friends and partners generates even more data to predict traits and potential consequences of combining DNA.

The second phase in a machine learning system is feature engineering. Someone who is not a professional data scientist may wonder what this phase contributes to the ML system. Briefly, feature engineering can be used to make ML models more accurate and precise. It is the process of using domain expertise to turn raw data into information-rich variables.

A person’s gene sequence, for example, is just data. When it is combined with other gene sequences, features such as chromosome pairs can be extracted that may point to various health concerns. Once sequenced, additional features can be derived over time: 10,000 customers may answer a survey about COVID-19 and then take a blood test that allows for more features to be computed.

Accountability in this phase is similar: to provide detailed, human-readable records of the feature engineering design and decision-making process that can be reviewed in later phases. Remember to hold the previous phase accountable; it has engineering features about data that come with the limitations in the handoff.

In the case of 23andme, there can be customer-level permissions for how data are allowed to be used, and users must respect these limitations. It may be permissible to use only some of a customer’s data in aggregate to benefit other users, but all data in their personal predictions can be used. To hold the previous phase accountable, the person-level data that are collected must be labeled in a way that allows requiring aggregation on certain features.

Once you begin the feature engineering phase, it is tempting to think that you know where your features will be used. However, this is often not the case. To make it personal, with your DNA and 23andme in mind, as you layer on data-sharing with external researchers, your friends and family, and even the community, what additional meta-data about your features will you have to record and pass along to the feature store and models that are being trained on your features?

This process is common practice in data science through code sharing, feature stores, and other systems that let teams share features outside the context for which they were originally designed.

Features are a combination of the data and the feature formulas, so anywhere the data go, the documentation or meta-data has to be easily available and inspectable in later phases. This means that at a minimum, data-level restrictions must be documented at the feature level. In addition, because time passes since the data are collected, the relevancy of data and features has a time window. Be sure to document any assumptions on time, as well as context, in the handoff to the next phase.

Model Training, Validation, and Production

The next phase is model training and serving.

The model phase will need to check the records of all previous phases to be sure that the data and features can and should be used to build a model to make its intended impact. That should mean that the person building a model can do an ethical check on the behavior you’re trying to promote. The model will directly affect or influence people’s behavior.

One way to do this ethical check is the behavioral science method detailed by Matt Wallaert in Start at the End: If the benefits do not outweigh the costs or the motivations and goals are misaligned, then the use of the model is not ethical.

The accountability of the model maker then adds licensing limitations and transparency about how the model can and should be used in addition to traditional measurements such as performance and accuracy. A model maker can put one more accountability system in place: continuous measurements of the actual impact on the intended outcome, as well as the unintended impacts—especially on protected classes.

Impact

The last phase of ML is impact. Often the impact will be behavior change, but not always. This phase is separate from the model-building phase because, more often than not, the consumers of ML are not the builders, and users may not be the people ultimately affected by the model. For example, businesses buy products with one ML model or more in them to make hiring decisions, but the people who are weeded out by the hiring software are the ones also affected.

The user of an ML product is accountable for validating all previous phases of accountability and ensuring that the ML product is used in accordance with its limitations. In addition, they are accountable for any biases that the model propagates while they are using it. This can be made clear in the hiring example; if you are paying for a product that is introducing bias into your hiring process, it is your responsibility to measure that bias and correct for it.

The cycle then continues. Sometimes these are closed-loop systems, where data are collected on the use of the products and fed back in to learn more. If this is the case, the user is accountable for telling the data collector when they are using the product incorrectly or their data will skew the next training set.

Conclusion

After knowing how to track accountability through each phase of the ML process, it is time to consider what you are accountable for, to whom you are accountable, and how far your accountability extends.

Do you have everything you need to be successful? Remember that you cannot be accountable without inspecting and reviewing the accounting of your actions. Will you start holding previous phases or others accountable before you start building on top of them and making warranties about your work? How do you have to change your processes or systems to get what you need? Enough individuals taking on accountability will create a ripple effect, gradually creating larger systems of accountability, without having to wait for regulations to create obligation-based accountability.

Case Study: Generating Language with ML Models

Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model for creating human-like text with deep learning technologies. It was released by OpenAI researchers via an Application Programming Interface (API) for anyone to request access to integrate the API into products, develop new applications, and explore the strengths and limits of the model.

Recently, it has been demonstrated that using a pre-trained general model, like GPT-3, and fine-tuning it on a specific task results in substantial gains in performance. People are already applying this machine learning model to many applications: semantic search, chat, customer service, generation, productivity tools, content comprehension, and more.

GPT-2, the previous version of this language-generation model, was open-sourced and used by many people in their applications, but GPT-3 is 10 times more complicated. Since these models are a type of machine learning called deep learning, exactly what the model learned is uninspectable. The model contains 175 billion parameters—the things learned in the training process to generate language—but certainly does not learn grammar the way children are taught in school.

These models were trained from examples of language on an unlabeled text data set, where words and phrases were randomly removed from the text so the model could learn to fill in the blanks by using only the surrounding words as context. The astonishing part is once the model is trained, supposedly only 10 new training examples are needed for it to perform specific tasks, such as writing an article apparently from a respected reporter without humans being able to tell the difference. Imagine the harm of deep-fake videos delivered by a chatbot talking like a human to serve up fake content.

Consider accountability from three perspectives: the data collector, the OpenAI researchers, and a data scientist who is using their API but does not work for OpenAI.

The OpenAI researchers use the data from Common Crawl, which provides a corpus for collaborative research, analysis, and education. It is an open data set containing petabytes of data collected since 2008 from raw web page data, extracted metadata, and text extractions. The creators of this data set established a system of accountability by publishing terms of service that restricts the licensing and use of their data.

Using this data set means that the OpenAI researchers had to agree to the well-documented terms of service that limit them from breaking the law or doing anything illegal with the data. Some illegal things include engaging in abusing, harassing, hateful, or otherwise offensive activities; invading other people’s privacy; harming minors; violating other people’s rights; disguising your identity, and harvesting personally identifiable information. They also cannot transfer their license to use the data to someone else, which means that Common Crawl has to know you are using their data so they can enforce the terms.

The researchers took a dependency on the data set and completed the next three phases of the process: feature engineering, model creation, and impact measurement. This process was documented in a 72-page manuscript titled “Language Models are Few-Shot Learners” and released to the world with the API. This document discuses broader societal impacts, including a section about fairness, bias, and representation. It focuses on biases related to gender, race, and religion. Some sample results relevant to using the API include:

83% of 388 occupations tested were more likely to be associated with a male identifier.
Women were more associated with appearance-oriented words like “beautiful” and “gorgeous.”
“Black” had a consistently low sentiment.
Words such as “violent,” “terrorism,” and “terrorists” were associated with Islam at a higher rate than other religions.

OpenAI also has committed to terminating API access for obviously harmful use cases, such as harassment, spam, radicalization, or astroturfing (“the deceptive practice of presenting an orchestrated marketing or public relations campaign in the guise of unsolicited comments from members of the public,” per Oxford Languages via Google). They also recognize that it would be impossible to anticipate all of the possible consequences of the technology being openly available, so it is still in a private beta rather than generally available to the public. They restrict the use based on licenses and terms of service, and follow through on these commitments by terminating access. The OpenAI’s blog explains how they are researching safety-relevant aspects of language technology.

Part of what makes an API more-controllable is that the code itself is not released to the users, allowing for OpenAI to intervene when terms are violated. A data scientist applying for access to this API has to describe their use of the data and join a waitlist. Assuming the request is approved, they have to sign and abide by the terms, so they are accountable to OpenAI for non-harmful use cases. But what other accountability might they have?

Assume the data scientist wants to build a model that predicts whether someone is likely to have COVID based on the description of their symptoms. This involves data collection and model creation. This work may seem unlikely to be affected by the biases discussed above, but there is no way for the data scientist to know the impact on people of different genders, races, or religions until the impact has been measured and there is a way of monitoring this impact.

This is partly because GPT-3 was trained on biased data that reflect the biases of society and the imperfect nature of language, but also because the API uses deep learning methods that are constantly updated.

Depending on the way the data scientist releases this model, they are also subject to the laws and regulations for giving medical advice in the countries where the model is available. Before the data scientist releases an application, they will have to set up a way to prove to OpenAI that their use case is not biased. Otherwise, they face the risk of being shut down by OpenAI or landing in legal trouble and facing penalties for giving medical advice or treatment without having an active professional license.

To succeed, this entire system of accountability relies on individuals to understand and communicate what they are accountable for, to whom they are accountable, and how far their accountability extends.

About the Author

Charna Parkey is a data science lead at Kaskada, where she works on the company’s product team to deliver a commercially available data platform for machine learning. She is passionate about using data science to combat systemic oppression, and has more than 15 years’ experience in enterprise data science and adaptive algorithms in the defense and startup tech sectors. She earned her PhD in electrical engineering at the University of Central Florida and has worked with dozens of Fortune 500 companies.

Tagged as: accountability, data bias, data collection