Augmenting the Data: Vukosi Marivate on African Natural Language Processing (NLP)

[This event is part of our AY 2022-23 series on large language models (LLMs). The event was organized and co-sponsored by Critical AI @ Rutgers, DIMACS, the Rutgers Department of Computer Science, and the Institute for the Study of Global Racial Justice. Below is a blog on the event. Click here for a video of the event.]

Esther Mahlangu, Ndebele Patterns (2018), Artsty

By Eleni Coundouriotis (English, University of Connecticut)

Vukosi Marivate (Chair of Data Science, University of Pretoria) makes a compelling case: we need to double down on the effort to capture “low resource” languages. There is too little existing data that can be used for natural language processing (NLP) of African languages and the danger exists of a growing inequity in representation. Without efforts to catch up, “our debt” he says, will increase; the “bill” will be higher, and harder to pay down. Focusing on his native South Africa, but addressing language equity across the continent, Marivate spoke of the pull of English. The vast majority of data available through journalistic and social media is in English. Official government business is in English. Afrikaans, a language derived from Dutch colonizers, is the second well represented language. Although it is in a much weaker position than English, it too outcompetes nine widely-spoken indigenous languages in South Africa.

Marivate has pushed creative initiatives to increase the data set and spoke in detail about his methodology. What stands out is his emphasis on discovery: thinking outside the box about where to find existing data (for example, in all the translated versions of the South African constitution). But he also pushed for the idea of intensified translation from English into the nine “low resource” languages (LRLs). He ended his talk on an inspiring note, showing evidence of impressive, collective efforts led by young data scientists throughout the continent. Initiatives such as Masakhane, a grassroots NLP community for Africans by Africans, is leading the charge to create more usable data for LRLs through various methods of digital augmentation.

As a humanist scholar of African literature with no  expert knowledge of NLP, but strong interest in the emerging field of critical AI studies, I was drawn to the historical contextualization of the language problem. The contact zone of colonial encounters  involved language to a significant extent and with continuing impacts till today.  Marivate talked about the first translators of African languages into European languages: the Christian missionaries. Translating the Bible so as to facilitate conversion, missionaries created the first written texts in African languages; but since their translations were unsophisticated and utilitarian, they did not capture the languages accurately. Their mistakes have endured because they were codified early. This historical example demonstrates the advantage of today’s translators who have mastery of both the source and the target language. But it also underscored another dimension of the challenge that is perhaps not so pressing but interesting all the same: correcting established mistranslations.

1962 Conference at Makere University (Photo: Zimbabwe’s National Gallery of Art)

Language disparity is one of the most debated areas in the field of African literary studies. There were many points of convergence between Marivate’s presentation and the ways in which the debate over the language of African literature has played out since the 1962 “Conference of African Writers of English Expression” at Makerere University. In attendance were Chinua Achebe and Ngũgĩ wa Thiog’o, two writers who were to become Africa’s literary giants and who clashed specifically on language use, debating this issue for decades to come.

At the conference, the question “what is African literature?” focused on a discussion of language. Although drawing together English speakers and thus representing only the Anglophones among the Europhone writers, the main preoccupation was with literatures in African languages, including literatures in Arabic and Swahili. The conference was forward-looking, with attendees asking this question of contemporary writers and imagining what the future of the emerging literature in modern forms such as the novel would be like. The writers were addressing their own aspirations and wondering what language they should write in.

Achebe became identified with the pragmatic argument that English was a world language that history gave him and hence he would use it and shape it so that it would capture his experience. Ngũgĩ has embraced the resistant argument. Marivate drew from Ngũgĩ’s Decolonizing the Mind to establish how language is a form of cultural colonialism and not a neutral tool: “The bullet was the means of the physical subjugation,” Ngũgĩ wrote, “Language was the means of the spiritual subjugation.” Language shapes the mind and disrupts the education of youth (as Ngũgĩ shows movingly in his early novels), drawing them away from their community and family. English use has had a destructive impact. Marivate takes inspiration from Ngũgĩ’s lifelong project to sustain and develop expression in African languages across all spheres of creative and everyday uses of language.

But perhaps it is Achebe’s example that contains some provocative lessons for the types of translation necessary for NLP. Keeping in mind that Ngũgĩ’s commitment to write in Gĩkũyũ has kept him engaged in translation (he has translated his own work from Gĩkũyũ into English, including his masterpiece on the impact of globalization, The Wizard and the Crow), we can see how translation is a constant practice, the results of which are never complete or definitive.

Xitsonga traditional drawing by Philemon Hlungwani

Achebe was also an obsessed translator. Developing a practice of inflection and appropriation, he aimed at transforming English so that it would capture Igbo rhythms and thought patterns. Whether he was wrongheaded or not in his approach, it was very influential. At the same time that he tried so hard to convey meaning from one language to another, Achebe retained a sense of the untranslatable. His novels include many untranslated Igbo words and expressions. As a result, he made available to non-Igbo speakers a vocabulary of Igbo words . This practice has become a signature of Europhone African literatures. Though these works from the continent are written in European languages such as English or French, they include a fairly significant footprint of African languages. African writers are unafraid to throw off their global readers by incorporating words, phrases, or even passages in African languages in their Europhone texts, signaling their embeddedness in their nation and, more specifically, their own language group. One wonders if contemporary African literature, much of which has been digitized, could provide a data source for African language NLP.

To be sure, this might not be practical or add up to much. Still, NLP researchers have the opportunity, in classic and accessible works like Achebe’s Things Fall Apart, to demonstrate a different lesson to a wide audience. Translation is not a transparent process. By choosing at times not to translate, Achebe allows Igbo words to gain layers of new meaning in his novels. In a sense, he is demonstrating how concepts might resist translation, but languages interpenetrate each other all the same. They do not exist in separate streams because their users can manipulate their flow.

As Simon Gikandi tried to convince the African Literature Association in a 2017 keynote, Achebe’s real contribution to the language debate was to demonstrate thehow even an African writer committed to using English might intentionally resist translation because the untranslated word is richer in meaning. By focusing on the term “chi” in Things Fall Apart (loosely translated as a personal deity or protective spirit, but, crucially, left untranslated in the novel), Gikandi showed that Achebe allowed “chi” to carry into English an opacity it already contained in Igbo. The context around the word and its use by the characters in the novel convey its meaning and the conceptual difficulty it carries. Words like “chi” are rich, semantic nodes we return to explore over and over. Language use, even when seeming practical or utilitarian, is not transparent.

Marivate, who describes himself as speaking his father’s language Xitsonga and is married to a woman who speaks isiNdebele, no doubt recognizes the indeterminacy of all language–which is always contextual and always able to confute the demand for fixity. This is especially the case in a context of multiple languages under conditions of the long history of colonization. NLP researchers working on English and other “high-resource” languages increasingly rely on the scale of data (scraped from the internet) and on the scale of computation to create models that autogenerate language without the benefit of human understanding. As he endeavors to adapt alternative NLP methods that do not rely on these data troves, Marivate is working closely with communities and with scholars of local languages and literatures. The datasets he is striving to build, he told us, are as much about creating these interdisciplinary and communal practices as about creating a benchmark dataset. I share his sense that this represents an important frontier for not only for African language NLP but for NLP more generally.

Leave a Reply