FROM BIG DATA TO DATA JUSTICE?: Workshop #3 on BIG DATA

[The Ethics of Data Curation is the first in a two-part series of AY 2021-22 workshops organized through a Rutgers Global and NEH-supported collaboration between Critical AI@Rutgers and the Australian National University. Below is the third in a series of blogs about each workshop meeting. Click here for the workshop video and the discussion that followed.

By Ryan Heuser (Cambridge University)

“Ethics of AI,” “ethical AI,” “data ethics”: you may have heard these phrases. They refer to a rapidly growing and enormously well-funded field of inquiry centering on questions of bias, fairness, and transparency in data science, machine learning, and (as the latter enterprise is sometimes called) artificial intelligence. As Catherine D’Ignazio and Lauren Klein document in their recent book, Data Feminism, big money is coming into data ethics initiatives from all quarters. The Rockefeller Foundation and Mastercard Impact Fund gifted $20 million to the non-profit organization DataKind, and the Stanford Institute for Human-Centered Artificial Intelligence is raising $1 billion for its research center. In stark contrast to the dearth of academic jobs in the humanities, advertisements for researchers in data ethics proliferate. In the final turn of the screw, even the tech companies whose technological practices provoke ethical questioning in the first place are hiring in “ethical AI.”

D’Ignazio and Klein cast doubt on this enterprise, even as their book to some extent participates in it. Calling it a “band-aid for a much larger problem,” they contrast ethics-centered critiques of artificial intelligence, which “locate the source of the bias in individual people and specific design decisions,” to critiques that “acknowledge structural power differentials” (60). They advocate for a shift in focus from “data ethics” to “data justice”; from “bias” to “oppression”; from “fairness” and “transparency” to “equity” and “co-liberation.” By doing so, critical data scientists can “work toward dismantling … [the] root cause of the problems that occur again and again in data and algorithms.” What is this root cause? The intersectional “matrix of domination”—laws, policies, administration, culture, experiences—within which data practices “work to uphold the undue privilege of dominant groups while unfairly oppressing minoritized groups” (61).

I begin with D’Ignazio and Klein in part because the BIG DATA workshop paired their chapter, “Collect, Analyze, Imagine, Teach” with Emily Denton and colleagues’ “On the Genealogy of Machine Learning Datasets.” Both were core readings for the third meeting of the Rutgers-ANU series on the Ethics of Data Curation. The meeting featured two presentations: Britt Paris focusing on the chapter from Data Feminism and Ella Barclay focusing on the datasets study. These commentaries set us up for a lively discussion on the meaning and implications of this turn from data ethics to data justice and, along with it, from individualist to structural analyses of data science. While nearly all participants agreed with the need for such a shift, a consensus over what that entailed, or how it might look, did not crystallize.

Indeed, the Denton et al. essay on ImageNet provided a convenient case study of how difficult the contemplated turn might be. Written by four Google employees and one researcher at the Center for Applied Data Ethics, the article criticizes that dataset’s failure to reckon with the distinct identities and personal standpoints of its annotators. “Central to ImageNet’s epistemology is the assumed existence of an underlying and universal organization of the visual world into clearly demarcated concepts,” it reads. This line of critique, as Paris pointed out, is familiar to most humanists and many besides: objectivity is an illusion, “raw data is an oxymoron” (Gitelman), all knowledge is constructed from perspectives.

Nevertheless, discussion of the article was fairly critical. Katherine Bode asked whether such “standpoint”-oriented critiques actually manage to bring us to an analysis of the structural injustices of data practices; do they, rather, remain bound to the individualist, ethical framework that D’Ignazio and Klein warn against? In a similar vein, Matthew Stone expressed disappointment with critiques of individual datasets, which miss an opportunity to identify the broader practices of machine learning which depend on producing these very datasets. Baden Pailthorpe, quoting the succinct line from Data Feminism“Who makes maps and who gets mapped?”—wondered what happens when the people and institutions asking “who makes the maps?” (Google employees) are, in fact, those who make the maps (Google employees). Does—and here I am riffing on his question—that very paradox reveal how questions of “who” and “for whom,” while useful, may nevertheless be an insufficient basis for a structural, self-reflexive framework of data justice?

To put this more broadly, the discussion of these two readings persistently pointed to deep and seemingly unresolved tensions within the critical frameworks surrounding AI. Data ethics or data justice? Bias in datasets, or oppression in data practice? Is “inequality” in data science a quantifiable fact of dirty data riddled with unequal distributions, or a qualitative fact to do with the social relations behind data’s production? Should we recover and emphasize the individual subject positions of data work, or ought we to zoom out onto the larger industry imperatives and systemic forces which position them?

Participants at the 1955 Dartmouth Summer Research Project on Artificial Intelligence (Photo: Margaret Minsky)

If I have deliberately overstated these bifurcations, it is because these questions feel unresolved and sometimes confused in AI debates. Even as adamantly structuralist a study as Data Feminism occasionally leans toward the alternate pole of explanation. The problem is partly the difficulty of analyzing data injustice through a pervasive language of “bias” and “imbalance” that explains these phenomena as the proliferation of certain ‘perspectives’ over others. For example,

The problems of gender and racial bias in our information systems are complex, but some of their key causes are plain as day: the data that shape them, and the models designed to put those data to use, are created by small groups of people and then scaled up to users around the globe. But those small groups are not at all representative of the globe as a whole, nor even of a single city in the United States. When data teams are primarily composed of people from dominant groups, those perspectives come to exert outsized influence on the decisions being made—to the exclusion of other identities and perspectives. This is not usually intentional; it comes from the ignorance of being on top. We describe this deficiency as a privilege hazard (D’Ignazio and Klein 28).

These statements from the first chapter of Data Feminism are true and important. But as a theory of structural inequality, they are probably insufficient to explain the mechanisms reproducing the inequality they describe. Important though it is to acknowledge, document, and criticize such material realities, we should not mistake phenomena such as “privilege hazard” for the mechanisms of their own production.

To be sure, D’Ignazio and Klein are often savvy theorists of structural inequality, as when they explain their use of the term “minoritized”: if “minority” is a quantitative concept of group size, “minoritization” points to qualitative processes of oppression. Nonetheless, when they move to explaining the origin of these oppressive processes, their emphasis often shifts to group identity. “Sexism,” they say, names the oppression of all other genders by men; “racism” of other races by whites; “classism” of other classes by “wealthy and educated people.” Regrettably, such an analysis seems to imply that women are incapable of sexism, or people of color of racism—obscuring their underlying motivational framework of exploitation. The awkward inclusion of “classism” makes this problem still more apparent: the logic of oppression within class struggle is not that wealthy people discriminate against the poor, but rather that an owning class expropriates the surplus value from a working class. It is the social relations of capitalist production, rather than any abstract notion of “classism,” which ensures exploitation and its inequality.

Certainly the sociologist Charles Tilly, in his influential Durable Inequality (1990), would make that case. Explanations, Tilly writes, that “rely in the last instance on shared interests, motivations, or attitudes as the bases of inegalitarian institutions … leave mysterious the cause-effect chains by which these states actually produce the outcomes commonly attributed to them.” Instead, he argues, “durable inequality among categories arises because people who control access to value-producing resources solve pressing organizational problems by means of categorical distinctions” (7-8). For example, within an organization, one group’s exploitation and expropriation of value created by another group—key to the capitalist mode of production—presents a pressing organizational challenge: how can such organizational inequality be safely maintained against worker strikes, walkouts, and protests? Here, the social categories of gender, race, religion, citizenship, and so forth, provide an organizational solution. When an organization divides its resources unequally along the ‘internal’ categories of its own design (e.g. owners/workers, executives/secretaries, tenure-/non-tenure-track employees), that unequal structure becomes more easily reinforced and naturalized by matching its internal categories onto external ones which have independent weight and meaning (e.g. men/women, white/non-white, citizen/foreigner). “Substantial inequality in the absence of rationalizing boundaries… generates rivalry, jealousy, and individual sentiments of injustice”; but “matching interior with exterior categorical boundaries (reinforced inequality) produces a low-cost, stable solution” to these organizational challenges (78).

As a more specific example, consider secretarial labor at its height in the mid-twentieth century. We already know that firms’ division of executives/secretaries breaks down strongly along gender lines; what we still want to know is the organizational solutions this gender division provided. For Tilly, “employers assign female secretaries to male executives” because it “import[s] a powerful distinction and relation directly into the firm in a way that reduces the likelihood of a subordinate becoming the boss’s rival.” That a firm’s internal hierarchy of executives/secretaries so closely matched the external category of men/women, facilitates and naturalizes the greater exploitation of secretarial labor—and thus helps to solve the larger organization problem for the firm’s owners, namely, how to extract as much surplus value as possible from their workers without driving them to quit or strike.

Perhaps, then, we might ask a different, more Tilly-esque question of the structural inequalities within information systems. Instead of asking “who makes the maps and who gets mapped”—a question of inequality whose answer we know all too well—we might ask what organizational solution does such inequality within digital systems provide? To answer this question properly, we need to do more than point out the power imbalances, privilege hazards, and unequal distributions that the discourse of bias tends to emphasize. We need not only a rigorous account of bias, and not only a thicker and more nuanced description of data science’s pervasive exclusions, but a robust theory of how its industry and practices actively exploit certain social groups in order to expropriate and hoard the resources they create—ensuring the production and reproduction of inequality in doing so.

Tech firms’ specifically digital forms of exploitation are useful to dwell on here. In The Age of Surveillance Capitalism, Shoshana Zuboff goes some way toward outlining these forms’ current corporate landscape and financial logics. For Zuboff, companies like Facebook, Apple, Microsoft, Google, and Amazon—what in the industry is known as FAMGA, which collectively makes up 40% of total value on the NASDAQ—have discovered and perfected an entirely new method of expropriating surplus value, harvesting vast amounts of behavioral data from users in order to squeeze out profit from their ability to predict users’ advertisement clicks. As Zuboff explains, companies like Google had initially largely used user input data to improve the functionality of their services. But following the discovery—frequently credited to Amit Patel, a Stanford graduate student and later Google employee—that the company’s user data could be used as a “broad sensor of human behavior” (in Patel’s excited terms) against which predictive algorithms could be trained, companies began to extract “more behavioral data than it needed to serve its users” (Zuboff 82). The resulting “behavioural surplus” was “the gamechanging, zero-cost asset that was diverted from service improvement toward a genuine and highly lucrative market exchange” (81). Here’s where the big in Big Data begins to take the form of a new kind of capitalism. By 2010, Google’s chief economist, Hal Varian, was writing that “Data extraction and analysis is what everyone is talking about when they talk about big data” (65). If, as Zuboff explains, data is “the raw material necessary for … [the] novel manufacturing processes” of machine learning analysis, then as a whole, data extraction “describes the social relations and material infrastructure with which the firm asserts authority over those raw materials to achieve economies of scale in its raw-material supply operations” (65). Big data means big surplus value, which is then extracted at scale through the predictive capacities of machine learning.

Returning then to the question provoked by Tilly: what organizational solution does inequality within digital systems provide?, we might answer as follows. Here we can pair Zuboff’s explanation of the “zero-cost asset” of behavioral surplus to Tilly’s of the “low-cost solution” of categorical divisions. In redlining, for example, the categorical distinctions of race provided lending institutions an easy, low-cost solution to problems of organizing the terms and spatial distributions of their investments: white neighborhoods received favorable loans with an aim toward secure repayment, while black neighborhoods received predatory loans with an aim toward payment entrapment. As machine learning algorithms now take over these decision-making processes, determining a person’s worthiness for loans or even parole often unequally across racial lines, we begin to see a parallel situation. Existing categorical distinctions of race, which leave their traces behind in behavioral data surpluses, provide machine learning algorithms an effective, low-cost feature for solving their predictive tasks. Even if features like ‘race’ or ‘income’ are not directly encoded for these algorithms, the scale of the behavioral surplus available to tech giants is more than sufficient to rediscover them within proxies. As Salomé Viljoen explained in her presentation to a subsequent meeting of the Ethics of Data Curation, DATA RELATIONALITIES, companies and their algorithms have at their disposal data as fine-grained as a user’s phone charging habits, not to mention their phone operating system (Android or iOS) which correlates with income.

It is only in the context of an unprecedented, global, data mining operation that we can fully situate the minoritizing process of digital oppression. A recent paper (discussed in another recent workshop) describing the many “dangers” of large language models, for example—their environmental and financial costs, gaps in documentation and training data, opportunity costs and impactful biases—omits this arguably most fundamental danger of Big Data: that it makes possible new stages and scales of capitalist exploitation. By mining the troves of behavioral surplus data we cannot help but leave behind, algorithmic processes solve predictive tasks—deciding among advertisements, loan schemes, and parole possibilities—partly by feeding upon digital proxies for entrenched categorical divisions such as race. Perhaps the biggest danger of Big Data processes, then, is that by advancing new, digital forms of capitalist exploitation, large models enable categorical divisions to provide new kinds of digital solutions to organizational problems—thereby strengthening categorical hierarchies of power in new ways.

In the face of all these structural surpluses and exclusions, we nevertheless still need to ask ourselves: What should we do? Here is where I felt most inspired and intrigued by Data Feminism, as book and project. Examples run throughout the text of fascinating, against-the-grain digital projects which the authors call ‘counterdata’, such as the Local Lotto project that had high school students survey neighborhood residents about their experiences of the lottery and map their results. “Collecting” these counterdata projects—which address social injustices in their content and refute the profit motive behind corporate data collection in their form—provides, for the authors, the first active step toward data justice. Analyzing inequality, imagining co-liberation, and teaching across demographic inequalities, form the other three steps.

At the same time—and I am sure the authors would agree here—these steps, however necessary and inspiring, may not be sufficient to address the mechanisms of exploitation that produce and maintain inequalities in data science and AI. As Ella Barclay, herself both an artist and researcher, warned during the workshop discussion, artistic and other forms of creative and counter-practices are not enough to save us from structural problems.

Though ideas have been floated for regulating Google , I believe we also need to work toward communizing Google and other digital infrastructure in order to abolish the private ownership of value extracted from collective effort and labor. By curtailing the capitalist imperative to digitize and monetize every aspect of our lives, so that amplifying social inequities no longer provides an organizational solution to the profit structure of tech companies—the structural, durable inequalities within data science will lose their grip.


Leave a Reply