IGF 2020 WS #122&20 Data to Inclusion: Building datasets in African Languages

    Time
    Tuesday, 10th November, 2020 (08:40 UTC) - Tuesday, 10th November, 2020 (10:10 UTC)
    Room
    Room 2
    About this Session
    How can access to data in low resource languages be strengthened? What policy measures can strengthen AI-driven innovation for reinforcing multilingualism online? The session will present projects from Africa and Asia on developing and using datasets in low-resource languages to strengthen access to information. It will be the springboard for the launch of the Open for Good Alliance by UNESCO, GIZ, IDRC, Mozilla Foundation among other founding partners.
    Subtheme

    Organizer 1: Bhanu Neupane, UNESCO
    Organizer 2: Irmgarda Kasinskaite, UNESCO
    Organizer 3: Prateek Sibal, UNESCO
    Organizer 4: Philipp Olbrich, GIZ
    Organizer 5: Naeem Uddin, Torwali Research Forum
    Organizer 6: Jaewon Son, Korea Internet Governance Alliance
    Organizer 7: Elliott Mann, Swinburne Law School

    Speaker 1: Dorothy Gordon, Civil Society, African Group
    Speaker 2: Philipp Olbrich, Government, Western European and Others Group (WEOG)

    Additional Speakers

    Workshop Speakers

    Speaker 1: Dr. Joyce Nabende, Makerere University, Uganda, Civil Society, African Group
    Speaker 2: Kathleen Siminyu, AI4D Network Africa, Kenya, Civil Society, African Group

    Speaker 3: Roy Boney Jr., Cherokee Language Program, Western European and Others Group (WEOG)
    Speaker 4: Subhashish Panigrahi, Civil Society, Asia-Pacific Group

    Introductory remarks by UNESCO

    Speaker list updated after combining the workshop proposal with WS #20 Exploring the future of endangered languages in cyberspace.
     

    Moderator

    Bhanu Neupane, Intergovernmental Organization, Asia-Pacific Group

    Online Moderator

    Irmgarda Kasinskaite, Intergovernmental Organization, Intergovernmental Organization

    Rapporteur

    Prateek Sibal, Intergovernmental Organization, Intergovernmental Organization

    Format

    Round Table - Circle - 90 Min

    Policy Question(s)

    The workshop seeks to address the following key questions:

    1. Can the Internet be used to revitalize minority, indigenous and endangered languages?
    2. How can Machine Learning and AI improve the availability of minority, indigenous and endangered languages datasets?
    3. What kind of policy frameworks can enable further actions on strengthening minority, indigenous and endangered languages on strengthening multilingualism in underserved regions?
    4. How can stakeholders best raise awareness of the issue of endangered and data-poor languages?

    This workshop will highlight the following issues:

    1. Digital language endangerment
    2. The catalytic boost in the process of language extinction due to the Internet
    3. The digital presence of endangered languages
    4. The low availability of resources for the development of technical solutions
    5. The lack of existing benchmarks and research in the development of digital solutions.

    Additionally, this workshop will explore the following opportunities:

    1. The use of machine learning, natural language processing, and artificial intelligence to combat language endangerment.
    2. The use of open source technology to combat language endangerment.
    SDGs

    GOAL 4: Quality Education
    GOAL 8: Decent Work and Economic Growth
    GOAL 9: Industry, Innovation and Infrastructure
    GOAL 10: Reduced Inequalities
    GOAL 16: Peace, Justice and Strong Institutions
    GOAL 17: Partnerships for the Goals

    Description:

    The ability to deal with human language is an essential attribute in all information and communication technologies. Although there are currently more than 7000 spoken languages, less than 100 of these are flourishing in the digital world with advanced language understanding and spoken language communication technologies.

    In the case of low resource, minor and endangered languages, there is a recognised need to develop solutions which ensure these languages still have a place on the Internet. Particularly, there remain gaps in terms of access to data for training statistical machine learning systems which could be leveraged for developing downstream applications. Such applications could provide for the digital inclusion of speakers of low resource language and hence their active participation in knowledge societies.

    The UNESCO publication “Steering AI and Advanced ICTs for Knowledge Societies”, launched at IGF 2019, identified “strengthening cooperation between civil society and research institutes for solving problems facing local communities, for novel data collection models based on citizen science that can create data sets for AI that respect international norms for privacy and data protection” as an option for action to address the gaps in the availability of data for development and use of AI in endangered African languages (Hu, et al. 2019).

    This workshop is proposed as a follow-up to the above recommendation and will extend beyond the focus on Africa to encompass a broader discussion on the impact of the Internet and technology on endangered languages.

    The workshop would enable North-South and North-South-South collaboration at the IGF 2020 and would develop networks and agenda for the workstream on AI, Data and Languages for IGF in Addis Ababa. It will further provide useful inputs for the International Year of Indigenous Language (2022-2032)

    Expected Outcomes
    1. A greater understanding of stakeholder and youth specific roles in digital safeguarding of endangered languages
    2. Outline strategies for next phase of dataset development in endangered languages, particularly in Africa.  
    3. Agenda for policy advocacy for language technologies and dataset development as part of International Decade for Indigenous Languages to be launched in 2022.
    4. A framework for North-South and North-South-South

    Discussion Facilitation: 

    Beyond the presentations by the speakers, this workshop will include a large open floor component, where participants can raise questions and comments with the speakers and with other participants.

     

    The moderator will seek to garner participation from a wide variety of attendees – with a particular focus on those from underrepresented regions and demographics, such as the Global South and youth respectively.

     

    During the discussion time allocated in the latter half of the session, discussion will be guided by the aforementioned policy questions, and by the earlier presentations by the speakers.

     

    The organisers anticipate that representatives from the following stakeholders will be in attendance:

    1. AI for Development Network – Africa
    2. Data Science for Social Impact – University of Pretoria Research Group
    3. Data Science Nigeria
    4. Masakhane – Machine Translation for African Languages
    5. Deep Learning Indaba – African Machine Learning Conference
    6. UNESCO Chair in Data Science and Analytics, University of Essex, United Kingdom
    7. UNESCO Chair in Artificial Intelligence, University College London, UK
    8. UNESCO Category 2 Centre – International Research Centre on Artificial Intelligence (IRCAI), Slovenia
    9. African Academy of Languages
    10. GIZ, Germany
    11. IDRC, Canada (TBC)
    12. Universal Labelling Project, USA (TBC)
    13. European Language Resources Association (ELRA) (TBC)
    14. Open for Good Alliance

    Relevance to Internet Governance: Part of the importance of Internet Governance is how it evaluates the consequences of the Internets rapid raise. Language endangerment should be seen as one such consequence.

    As set out in the  Los Pinos Declaration on the Decade of Indigenous Languages (2022-2032); which called for the design and access to sustainable, accessible, workable and affordable language technologies. Both UNESCO’s 2003 Recommendation concerning Promotion and Use of Multilingualism and Universal Access to Cyberspace and the 2020 Los Pinos Declaration on the Decade of Indigenous Languages (2022-2032), recognize the potential of digital technologies in supporting the use and preservation of low or under resourced languages.

    This workshop will analyse the work needed to right the wrong created by the Internet, by focusing on the technologies and policy settings needed to revitalise endangered languages. For example, UNESCO’s International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, organized in December 2019, underlined efforts to develop spelling/grammar checkers up to speech and speaker recognition, machine translation for text and audio, speech synthesis, and spoken dialogue among others as important areas for enabling linguistic diversity and multilingualism.

    This workshop will also highlight the work remaining to extend these technologies to under-resourced languages. This situation puts the users of many languages – a vast majority of Indigenous languages – in a disadvantageous situation, creating a digital divide, and placing their languages in danger of digital extinction, if not complete extinction. This work will require a multistakeholder effort – further linking this workshop into Internet Governance.

    Relevance to Theme: The proposed session is related to the selected thematic track of “Digital Inclusion.” Frequently, as the Internet has very little or nothing to offer in the marginalized and endangered languages, and indeed oppresses them, these language groups lack the digital presence as they are underserved and suppressed.

    Particularly in Africa, UNESCO has been vocal about the need for enhancements in language resources to enable technology solutions which can assist people limited by their language to interact in cyberspace. A salient example, in the context of the COVID-19 crisis, is how investment in open solutions for language technologies could lead to long term capacity enhancement to respond in public health crises is in the form of text analysis methods can be used to pre-warn health authorities of the outbreak (Tsvetkov 2017). For instance, social media posts in endangered languages could be analysed for outbreak of flu. This capacity simply does not exist at the moment – which is an issue this workshop seeks to address.

    Online Participation

    Usage of IGF Official Tool. Additional Tools proposed: UNESCO Teams to facilitate participation of UNESCO field offices networks in Africa

     

    Agenda

    Agenda:

    1. 8 mins – Introduction
    2. 32 mins – Panel discussion
    3. 37 mins – Open Floor Discussion
    4. 10 mins – Conclusion & Session Summary