[The Ethics of Data Curation is the first in a two-part series of AY 2021-22 workshops organized through a Rutgers Global and NEH-supported collaboration between Critical AI@Rutgers and the Australian National University. Below is the first in a series of blogs about each workshop meeting. Click here for the workshop video and the discussion that followed.]
by Lauren M.E. Goodlad (English/Comparative Literature, Rutgers)
How large is a “large language model” (LLM) and what have “stochastic parrots” got to do with them? If such models are dangerous (as Emily M. Bender, Timnit Gebru, et al. argue in their much-discussed essay), why do tech companies such as Google want them so badly? At a time when Big Tech has significant power to influence academic as well as industrial research, what can be done to bend the arc toward democratic decision-making and the public interest?
These were the subjects tackled in the first meeting of the Ethics of Data Curation. Our crack facilitating team paired Katherine Bode a researcher in data-rich literary analysis at ANU with Matthew Stone, chair of Computer Science and a researcher in Natural Language Processing (NLP) at Rutgers. Their presentations and the discussion that followed make first-rate viewing both for those already engaged by these topics and those curious to know more.
While “On the Dangers of Stochastic Parrots: Can Large Language Models Be Too Big?🦜” should be read for its cogent arguments, the paper became famous because two of the co-authors lost their jobs as the co-leaders of Google’s ethics team in the wake of its appearance (and still other Google employees withdrew their names to avert that fate). That Google chose to suppress the intellectual freedom of its own ethics team is a topic for serious inquiry in itself—but our workshop focused on the paper.
As many reading this blog will know, what is commonly called “AI” today is usually a form of data-centric machine learning. Such technology does not derive its “intelligence” from a human-like capacity to critically reflect on the world but, rather, through the mining of huge troves of data through an arsenal of computational power. Through a process called “deep learning,” software architectures composed of billions of randomly generated parameters analyze datasets too large for human intelligibility and locate tentative patterns. Although the randomness of these oceans of data (which in the case of OpenAI’s GPT-3 includes most of the scrapable internet) makes these models “stochastic,” they nevertheless offer useful predictions. As large language models, GPT-3 (licensed by Microsoft) and Google’s BERT are “parrots” because they generate plausible synthetic text without benefit of any human-like understanding. As Bender, Gebru, et al. put it, a “stochastic parrot” is “a system for haphazardly stitching together sequences of linguistic forms” that have been observed in the training data “according to probabilistic information about how they combine, but without any reference to meaning” (emphasis added).
In her review of the paper, Bode outlined the risks of the trend toward ever-larger models which include (i) the formidable environmental and financial costs of training and processing; (ii) the undocumented and unconsented data on which they rely which, among other problems, a) anchors these costly models to a static timeframe while b) over-representing the perspectives of the young, white, male, English speakers who dominate internet sites such as Reddit and under-representing everyone else.
This means that LLMs reproduce the biases of their training sets: e.g. correlating Muslims with violence, and parroting hate speech and toxicity. There is no easy way to fix these problems: for example, when a commercialized use of GPT-3 for Dungeons and Dragons began churning out stories that featured sex with children, the company’s fix was a filter that ended up flagging references to an “8-year-old” laptop. The underlying problem is a model that lacks a built-in understanding of the world that would help to contextualize the difference between an 8-year-old child and an 8-year-old laptop. This creates a dilemma for any programmer keen to monitor conversations about 8-year-old children while leaving old laptop discussants to their own devices. It also demonstrates the indissoluble tie between LLMs and surveillance.
In a final problem Bode described, (iii,) the “bigger-is-better” approach to language modeling squeezes out research on alternatives that use smaller and carefully curated datasets. These documentable corpuses could be designed to reach a broader sample of the world’s many languages and speakers—a clear example of the ethical stakes of data curation.
All of these risks, Bode clarified, worsen the underlying inequality. For whereas marginalized populations bear a disproportionate share of the environmental and social risks of LLMs, the benefits overwhelmingly accrue to an elite that includes some of the world’s richest people and corporations. In this way, “bigger is better” becomes a self-fulfilling prophesy for companies that already have ready access to troves of data and can spend millions of dollars (and emit tons of carbon) with ease.
Bender, Gebru, et al. conclude with a series of recommendations which, as outlined by Bode, include (i) research budgets that cover the costs of data curation and documentation; (ii) attention to the needs of marginalized communities (especially early in the design process); (iii) improved evaluation (asking how a model achieves its results instead of narrowly focusing on benchmarks and leaderboards; and (iv) researching technologies that attend to the meaning and interpretation of language.
Bode then passed the baton to Stone whose “whirlwind tutorial,” explained how LLMs came to dominate language research in the first place. Computers, Stone emphasized, do not process language as human beings do. Whereas humans have innate and acquired capabilities for common sense reasoning which help them to disambiguate “throughout” from “threw out,” computational models must rely on the probabilities gleaned from datasets. Thus, “shopkeeper threw out” tends to correlate with spoiled milk while “traffic throughout” tends to co-occur with references to cities such as Chicago or New York. The upshot, as Stone put it, is that AI researchers today are “not doing cognitive modeling” of a human-like form of critical thinking, but, rather, are “doing brute force.”
Among the many observations Stone highlighted, was the idea that “language is big and full of rare events.” This means that any given usage of a word or phrase, even if commonplace (“threw out”), becomes that much more rare as the scope of analysis expands to include new coinages and novel combinations (including, for example, Bender, Gebru, et al.’s coinage of “stochastic parrot”). LLMs such as GPT-3 thus rely on “orders of magnitude more data than any human could experience in their lifetime.” And yet, because the needed patterns for any given usage will remain relatively few, large language models are (as I would put it) paradoxical: the more data on which any given model is trained, the more data (and computing power and carbon emissions) the next iteration will require to achieve significant improvements. As Stone explained, the transformer technology on which GPT-3 is built is one of a series of innovations which mitigated the problem of deep learning’s voracious need for data. Yet, as a recent article avers, “the bigger-is-better approach has begun to yield “diminishing returns.” For Stone, the issue points to LLM’s underlying “logic of surveillance.” In the quest for a perfect set of results “you might find yourself trying to capture all the language of everyone alive.”*
Nonetheless, if something like half the usages in any given database are truly rare, the rest will be somewhat familiar. Here the literature professor in me thinks of Gustave Flaubert’s famous lament on the limitations of “human speech” which the narrator of Madame Bovary (1857) likens to “a cracked kettle on which we tap crude rhythms for bears to dance to, while we long to make music that will melt the stars.” A consummate artist, Flaubert agonized over the difficulty of creating stunningly original and evocative language (“music that will melt the stars”). He worried about the terrible preponderance of hackneyed speech (“crude rhythms for bears to dance to” like Emma’s clichéd love notes or banal talk of spoiled milk or traffic jams). Of course, from the data-centric modeler’s standpoint, hackneyed speech is a kind of gift in enabling machines the more readily to discern useful patterns. That a computational model does a good job of “parroting” commonplace utterances—“Hope this finds you very well”— may help to explain why the ever sardonic Flaubert is believed to have owned a stuffed parrot that became his muse (the premise of Julian Barnes’s 1984 novel Flaubert’s Parrot).
If my segue to a famous novelist’s meditations on the pitfalls of language at one level throws light on the human condition from an artist’s point of view, at another, in the vast data soup of the scrapable internet, it hardly shows up at all. Wondering what GPT-3 might “think” about parrots, I fed a prompt to one of the few publicly available sources for a version of OpenAI’s technology. “Why,” I asked the model, “would Flaubert’s stuffed parrot ironize the notion of a stochastic parrot?” “That parrot is probably a parrot, and not a parakeet,” the model offered; after which it repeated this observation in slightly different words; asked “What exactly is ironizing here?”; generated a possible email subject line: “Re: Stochastic parrots: Ironizing or not?”; all before meandering into one of those mid-sentence silences that seem to occur when stochastic parrots run out of steam.
Lest anyone think that I am feeling smug here, I am not. I have seen too many 1960s-era narratives in which heroic characters (like Patrick McGoohan’s role in Episode 6 of The Prisoner) defeat troubling AIs by stumping them with a flummoxing query. When McGoohan asks an insidious teaching machine to answer a single question—“Why?”—the big clunker sputters and smokes, its circuits overloaded by a single glimpse of existential uncertainty. Such a man-vs-machine throwdown is neither my goal nor even my favorite episode of The Prisoner (though it does wonderfully anticipate Judea Pearl’s critique of data-centric machine learning in The Book of Why and a long line of haters on educational software).
While Stone showed us why it might be hard in the short run to give up the affordances of supersized datasets, Bode, resuming her presentation, returned to the question of ethical implications. Taking up the perspective of philosophy, she discussed Nick Bostrom’s “Vulnerable World Hypothesis,” comparing its advocacy of caution to the claims of Bender and her colleagues. “Technology policy,” Bostrom writes, “should not unquestioningly assume that all technology progress is beneficial.”
Turning to literary criticism, Bode contrasted the claim that stochastic parrots create an “ethical vacuum” (in the sense that no one is “accountable for their outputs”) to the standpoints of book and media history. According to these critical perspectives, “meaning” does not originate in a single actor (not even Flaubert!) but is, rather, “dispersed across…human and non-human agents.” Thus, it is not only authors but also editors, agents, and technologies of production, circulation and reception that “intervene in and shape the meaning of communication in ways that are not only intentional but infrastructural.” While Bode recognized the need to address the harms of LLMs, she wondered whether the attendant questions of meaning, responsibility, and agency ought to devolve to the humanness or not of the actor.
This is an intriguing question that, taken to its extreme, might suggest that in the effort to center urgent human needs, Bender, Gebru, et al. have abstracted communicative exchange from its infrastructural substrates—up to and including LLMs. To parrot GPT-3, “What exactly is ironizing here?”
*Reviewing this blog for a fact-check, Stone emphasized that the rarity at stake is mainly the combination of words: “If you pick two random low-frequency words that are sensible but not really related to one another, the phrase may or may not have ever been used before. The phrase ‘stochastic parrot’ itself was coined for Bender and colleagues’ paper and even today usages of this phrase still overwhelmingly trace back to their paper. The metaphors of a ‘stochastic mockingbird’ or (for our Australian partners) a ‘stochastic lyrebird’ fit Bender and colleagues’ point just as well. But google those phrases in quotes and (assuming this blog post is indexed) you’ll get the page you’re reading and no other—at least until our image catches on. As best we can tell, those phrases are tokened nowhere else on the internet.”