Session
Organizer 1: Technical Community, African Group
Speaker 1: Brumund Daniel, Government, Western European and Others Group (WEOG)
Speaker 2: Lilian Diana Awuor Wanzare, Government, African Group
Speaker 3: Rutunda Samuel, Private Sector, African Group
Speaker 4: Melissa Omino, Private Sector, African Group
Speaker 5: Ochola Viola, Government, African Group
Speaker 6: Deshni Govender, Intergovernmental Organization, Intergovernmental Organization
Speaker 2: Lilian Diana Awuor Wanzare, Government, African Group
Speaker 3: Rutunda Samuel, Private Sector, African Group
Speaker 4: Melissa Omino, Private Sector, African Group
Speaker 5: Ochola Viola, Government, African Group
Speaker 6: Deshni Govender, Intergovernmental Organization, Intergovernmental Organization
Format
Roundtable
Duration (minutes): 90
Format description: A 90-minute roundtable format is ideally suited for this session because it fosters in-depth, participatory dialogue necessary to unpack the complex ethical, cultural, and policy dimensions of language data governance. Unlike traditional panels or lectures, the roundtable fosters a space where diverse stakeholders—community organizers, policymakers, technologists, and funders—can collectively interrogate tensions between openness and sovereignty, share lived experiences, and co-design solutions. The extended timeframe allows for nuanced exploration of case studies (through aspirations articulated in National AI strategies from the aforementioned African countries) while maintaining flexibility to pivot toward emerging themes, such as reciprocity models or decolonial licensing frameworks. The session will prioritize conversations and questions over presentation, avoiding a top-down AI discourse. Further, through this structure, participants can critically reflect on power dynamics, propose actionable alternatives, and forge alliances, laying groundwork for policies that centre justice and sustainability in Africa’s NLP ecosystems.
Duration (minutes): 90
Format description: A 90-minute roundtable format is ideally suited for this session because it fosters in-depth, participatory dialogue necessary to unpack the complex ethical, cultural, and policy dimensions of language data governance. Unlike traditional panels or lectures, the roundtable fosters a space where diverse stakeholders—community organizers, policymakers, technologists, and funders—can collectively interrogate tensions between openness and sovereignty, share lived experiences, and co-design solutions. The extended timeframe allows for nuanced exploration of case studies (through aspirations articulated in National AI strategies from the aforementioned African countries) while maintaining flexibility to pivot toward emerging themes, such as reciprocity models or decolonial licensing frameworks. The session will prioritize conversations and questions over presentation, avoiding a top-down AI discourse. Further, through this structure, participants can critically reflect on power dynamics, propose actionable alternatives, and forge alliances, laying groundwork for policies that centre justice and sustainability in Africa’s NLP ecosystems.
Policy Question(s)
1. Licencing and ownership: How can AI training data licenses be adapted to protect cultural sovereignty and ensure equitable benefits for (especially marginalized) language communities?
2. Equity: What concepts in policy mechanisms can prevent extractive practices (beyond prevention of cross-border data flows) in global AI initiatives targeting African languages?
3. Governance: Is there any way that operationalization of National AI strategies (e.g., Kenya, Rwanda, Ghana) can support community-led governance models for language datasets? If so, how and if not why not?
What will participants gain from attending this session? 1. Critical Awareness on the power of control of IP: Insights into our work on CC0 licensing; the participants will listen to perspectives from experts on feedback from communities, that is, while enabling open collaboration, this licensing can inadvertently marginalize communities by undermining local control and equitable resource allocation in AI development.
2. Experts will lend their perspectives on African language communities, policymakers, and practitioners on balancing openness with cultural sovereignty, data ownership, and sustainable investment models.
3. Not another roundtable: we will unpack the problems in a relatable way but also discuss practical alternatives such as community-led licensing, reciprocity agreements, that prioritize equity and ethics; potentially also that align with national AI strategies in countries like Kenya, Ghana, and Rwanda.
SDGs
Description:
Language is culture and culture is identity. Yet the digital identity of Africa is often skewed, manipulated, misinterpreted or disproportionally commercialised. The Language data collection is characterized by a significant disparity between large-scale, publicly accessible resources and numerous smaller, isolated efforts. Mozilla Foundation seeks to positively impact the way in which local language data is viewed, collected, stored and utilised. Mozilla Common Voice is the world’s largest, most diverse crowdsourced multilingual open speech corpus hosting 30,000+ hours and 180 different languages and is an example of a successful community initiative. It is a self-service community platform, as well as a lab for linguistic inclusion and for traversing data governance issues in Natural Language Processing (NLP). However, recent years have seen an awakening and sentiment change amongst the language communities (speakers who crowdsource datasets for localized, community-led AI). Stakeholder dialogue between Mozilla, governments, funders and other actors about the consequential effects of the radically open (CC0) approach to data which Common Voice has become more prevalent (notably, the ability to still serve the community whilst avoiding exploitative and extractive practices). These issues include inequitable investment, locally-sensitive community control and organizational power dynamics. This session intends to highlight the intended and unintended and ripple effects of the CC-0 (open public licences) on communities working on language data and language AI in Africa. The session will be structured in two components: governance & policy + actionable strategies and showcasing of pilot projects.
Language is culture and culture is identity. Yet the digital identity of Africa is often skewed, manipulated, misinterpreted or disproportionally commercialised. The Language data collection is characterized by a significant disparity between large-scale, publicly accessible resources and numerous smaller, isolated efforts. Mozilla Foundation seeks to positively impact the way in which local language data is viewed, collected, stored and utilised. Mozilla Common Voice is the world’s largest, most diverse crowdsourced multilingual open speech corpus hosting 30,000+ hours and 180 different languages and is an example of a successful community initiative. It is a self-service community platform, as well as a lab for linguistic inclusion and for traversing data governance issues in Natural Language Processing (NLP). However, recent years have seen an awakening and sentiment change amongst the language communities (speakers who crowdsource datasets for localized, community-led AI). Stakeholder dialogue between Mozilla, governments, funders and other actors about the consequential effects of the radically open (CC0) approach to data which Common Voice has become more prevalent (notably, the ability to still serve the community whilst avoiding exploitative and extractive practices). These issues include inequitable investment, locally-sensitive community control and organizational power dynamics. This session intends to highlight the intended and unintended and ripple effects of the CC-0 (open public licences) on communities working on language data and language AI in Africa. The session will be structured in two components: governance & policy + actionable strategies and showcasing of pilot projects.
Expected Outcomes
1. Interfacing this work with policy: We hope to discuss and obtain policy-level recommendations for integrating localized data ownership and move towards a data sovereign Africa – when it comes to African countries operationalising their national AI strategies (e.g., Kenya, Rwanda, Ghana).
2. Stakeholder Alignment (regionally and internationally): Shared understanding among policymakers, funders, and communities on the merits and demerits of the CC0 model and the need for alternatives that promote improved equitable resource allocation in AI development.
3. Provide initial results on the work that empowers community ownership: We hope to share preliminary but practical results on frameworks for participatory data stewardship, enabling communities to set terms for data use, access, and commercial benefit-sharing. This also includes definitions for metrics that measure ethical compliance in language dataset creation.
Hybrid Format: 1. Facilitate interaction between onsite and online: The session will be moderated in person and the onsite moderator will engage both the onsite and online speakers by asking questions and inviting onsite and online participants to contribute to the discussion by making statements. A note-taker will summarize the key points on a virtual whiteboard for online participants. The onsite/online Audience will also be encouraged to add statements/questions to the discussion (participation of online audience will also be facilitated)
2. Ensuring best possible experience for online and onsite participants: Online interaction in this session will be facilitated by incorporating in real-time the thoughts and questions of the audience expressed in writing in the chat or the Q&A function of the online platform provided for the session.
3. complementary online tools/platforms: online polling tool such as Menti or Slido to keep participants both in the room and online engaged throughout the session