IGF 2024 WS #254 The Human Rights Impact of Underrepresented Languages in AI

Session

Subtheme

Advancing human rights and inclusion in the digital age

Organizer 1: Gustavo Fonseca Ribeiro, Internet Society
Organizer 2: Kathleen Scoggin, American University
Organizer 3: Agarwal Bhoomika, Internet Society

Speaker 1: Gustavo Fonseca Ribeiro, Civil Society, Latin American and Caribbean Group (GRULAC)
Speaker 2: Agarwal Bhoomika, Civil Society, Asia-Pacific Group
Speaker 3: Kathleen Scoggin, Civil Society, Western European and Others Group (WEOG)

Moderator

Kathleen Scoggin, Civil Society, Western European and Others Group (WEOG)

Online Moderator

Gustavo Fonseca Ribeiro, Civil Society, Latin American and Caribbean Group (GRULAC)

Rapporteur

Agarwal Bhoomika, Civil Society, Asia-Pacific Group

Format

Roundtable
Duration (minutes): 60
Format description: A hollow-square layout or a room with large tables with additional seats outside the configuration for a 60-minute duration is ideal for this session. The U-shaped arrangement will allow for speakers and attendees to face each other and communicate, creating an engaging and interactive environment. This also allows for easy online participation with clear sightlines and the ability for all participants to hear speakers. A 60 minute time-frame is well-suited for this discussion as it allows for enough time for diverse speaker presentations and online and onsite audience questions, without overwhelming attendees. This timeframe encourages concise presentations while still diving into substantive discussions.

Policy Question(s)

1. What are the impacts of training artificial intelligence (AI) systems on datasets with low-resource languages for international human rights and AI policy? 2. What policies and lines of action can support ethical data generation and access to training datasets with low-resource languages for AI and uptake by affected communities? 3. What policies can govern the collection, storage, usage and sharing of language data to ensure the cultural rights and interests of underrepresented communities?

What will participants gain from attending this session? First, participants will gain knowledge on the intersection of natural language processing (NLP) in AI and cultural rights under international law, such as the right to culture, to scientific progress, and intellectual property, including traditional knowledge. Moreover, they will understand the effects of language underrepresentation to areas of AI policy, such as bias and safety. Second, they will be able to receive and provide input in the form of needs, content, collaboration opportunities, and lines of action on a digital rights agenda towards cultural ownership and diversity of language in the global AI ecosystem. Third, participants will engage with the needs of different NLP/AI-related communities in the Global South vis-à-vis training datasets with their language of origin and explore how copyright arrangements can be designed to suit the needs of each. Fourth, they will be equipped to create incentive mechanisms to promote ethical data generation for low-resource languages.

SDGs

2.3
3.7
3.d
4. Quality Education
8. Decent Work and Economic Growth
8.9
9. Industry, Innovation and Infrastructure
10. Reduced Inequalities

Description:

AI continues to shape industries and innovation. Language plays a critical role in expanding the reach and capabilities of generative AI models. However, many languages are still underrepresented in training datasets. These are called "low-resource languages." For example, the Common Crawl is a free and open repository of web crawl data, widely used for the training of large language models. Yet, 46.5% of its documents are primarily in English. This is followed by Russian, German, Japanese and Spanish; each comprising around 5% of the dataset. According to UNESCO, there are over 8300 languages worldwide; whereas the Common Crawl contains only 160 languages. AI systems trained in a diverse set of languages is a precondition for advancing human rights and inclusion in the digital age. This session, "The Human Rights Impact of Underrepresented Languages in AI: The Unspoken South," will explore this issue by identifying problems and mapping solutions. First, it will underscore policy and societal implications of language underrepresentation in AI systems. This will include the impacts to cultural rights under international human rights law. This is, specifically, the rights to take part in cultural life; to enjoy the benefits of scientific progress; to benefit from the protection of scientific, literary or artistic production, including the protection of traditional knowledge. Moreover, the session will cover AI-specific policy implications, such as bias, fairness and safety. Second, the session will highlight lines of action to solve the challenge. This may include (1) the creation of incentive systems for people to contribute with data ethically; (2) awareness-raising to mainstream the topic within the digital rights agenda; (3) advocacy to unlock access to language datasets for communities that are culturally-associated with the data therein; and (4) co-designing copyright licenses that attend to the needs of low-resource language communities affected by AI.

Expected Outcomes

First, the session will be documented in notes and summaries. These will be made publicly available through on the webpage and shared with relevant AI and linguistic communities. This will ensure that the work done and strategies brainstormed during this panel will reach a broader audience. Second, a white paper will synthesize a map of problems and solutions identified. This will include findings and contributions on (1) the impacts of low-resource languages in AI for cultural rights and AI policy, (2) lines of action to solve language underrepresentation (e.g., data generation, unlocking access to datasets) and (3) policy arrangements that safeguard the rights and interests of affected communities. Third, the workshop will function as a stepping-stone for digital rights advocacy and further dialogue in other fora, such as regional IGFs, RightsCon, etc.

Hybrid Format: To facilitate interaction between onsite and online speakers and attendees, this panel would use large screens to display virtual speakers to the onsite audience and direct questions to both the onsite and online speakers. For the best experience, the session will include dedicated Q&A segments with a moderator to collect questions from online attendees through a chat feature and speaking through Zoom, ensuring equal participation. Onsite attendees would have similar opportunities to ask questions via handheld microphones, coordinated by an onsite facilitator. Additionally, to increase participation and interaction, this session will use real-time polling tools like Slido to gather audience feedback and generate interactive elements. These polls would encourage active engagement and create a more inclusive atmosphere for both onsite and online participants.

Reference Document