Jan 30, 2026

Language Accessibility in African Open Data: Barriers, Costs, and Interventions

Language Access

EnglishResearch Note1 author(s)

A comparative study of metadata accessibility across 12 African languages, quantifying the discoverability gap and identifying the highest-impact interventions.

Authors: Datum Africa Research Unit

This research note presents findings from a comparative study of metadata accessibility across 12 African languages, conducted by the Datum Africa Research Unit in 2025–2026.

Context: Africa has over 2,000 spoken languages. The 54 countries of the continent collectively recognize hundreds of official and national languages. Yet the dominant open data platforms and standards used across the continent operate almost exclusively in English, French, Portuguese, and Arabic. The remaining languages - spoken by hundreds of millions of people - have no infrastructure for open data documentation.

Study design: We selected 12 languages representing different language families, geographic regions, and writing systems: Amharic, Hausa, Igbo, Kinyarwanda, Lingala, Malagasy, Shona, Somali, Swahili, Tigrinya, Wolof, and Yoruba. For each language, we assessed the availability of open data platforms with UI support, the availability of data documentation guides, the discoverability of datasets documented in that language, and the community of active contributors working in that language.

Key findings: Zero of the 12 languages had full UI support on any major open data platform. Two languages (Swahili and Amharic) had partial UI support on one platform each. None had published data documentation guides. Datasets documented in any of these languages were systematically undiscoverable via search, because search infrastructure was not built to handle these language scripts or vocabulary.

The discoverability gap was quantified using a controlled experiment: 40 identical datasets were documented in both English and one of the study languages. English-documented versions were discoverable in 91% of search queries. Non-English-documented versions were discoverable in 12%.

Recommendations: The report outlines a phased approach to reducing the language accessibility gap, focusing first on the languages with the largest contributor communities (Swahili, Hausa, Yoruba, Amharic) and the highest volume of undocumented datasets, before expanding to others.