Project to create Inuktut language software and perform new text alignment of the Legislative Assembly of Nunavut proceedings

 

While Inuktut is an official language in the territory of Nunavut, there are far fewer technologies, tools and resources available for Inuktut learners and language professionals than for the territory's other two official languages, English and French.

The NRC is collaborating with the Pirurvik Centre and the Government of Nunavut to develop new technologies for Inuktut language learners and professionals, and to reinforce Inuktut's status as an official language in Nunavut.

Collaborators

Pirurvik Centre

The Pirurvik Centre is a centre of excellence for Inuit language, culture, and well-being. It was founded in the fall of 2003, and is based in Nunavut's capital, Iqaluit. Access to the Pirurvik Centre's expertise on the Inuktut family of languages is a tremendous asset for the NRC team.

Government of Nunavut

The Legislative Assembly of Nunavut kindly provided the NRC with an updated version of the Nunavut Hansard, covering proceedings between 1999 and 2017.

Objectives

  • Develop a new suite of tools for people who work with or are learning Inuktut: update to WeBInuk and creation of new iutools
  • Perform automatic sentence alignment of a new Nunavut Hansard corpus (1999-2017)
  • Create a machine translation system for translating between Inuktut and English, and foster research into machine translation between these two languages

Deliverables

Activities

Software tools for Inuktut as an official language: iutools

In October 2018, the NRC and the Pirurvik Centre began to collaborate on building software tools to assist people who work with Inuktut. Though it is an official language of Nunavut, Inuktut still lacks tools that are taken for granted in English and French. This project helped to fill the gap by implementing and deploying a web search engine, an aid to translators, a spell checker, and other tools for learners of the language, linguists, and people who work with Inuktut on a regular basis, such as employees of the Nunavut government. The project builds on ground-breaking work carried out previously at the NRC on morphological analysis, and on creation of a tool for translators called WeBInuk. The first version of the new tools, called iutools, was deployed in 2020, and is freely accessible on the Web.

Transcription and editing of 75 hours of Inuktut speech

To support research into automatic speech recognition at the Computer Research Institute of Montreal (CRIM), the NRC funded Pirurvik teams to transcribe 75 hours of recorded Inuktut speech, and edit the transcriptions to a high level of quality.

Automatic sentence alignment

In the past, research by computational linguists on Inuktut benefited greatly from a version of the Legislative Assembly of Nunavut proceedings – the Nunavut Hansard – with Inuktut and English sentences aligned with each other. This parallel corpus (body of text) was created and open-sourced by the NRC in 2005. The NRC project team has now completed automatic sentence alignment of the Nunavut Hansard proceedings between 1999 and 2017. The new parallel corpus is much larger than the version released by the NRC in 2005.

To quality check the automatic alignment, experts employed by the Pirurvik Centre manually aligned around 8,500 sentence pairs from the 1999-2017 Nunavut Hansard. This "gold standard" alignment enabled the NRC team to improve its automatic alignment algorithm. We expect the new sentence-aligned Nunavut corpus, along with the manually aligned "gold standard" subset, to encourage new work on Inuktut by the international research community.

WMT 2020 Shared Task on Machine Translation between Inuktut and English

A long-standing series of annual workshops called 'WMT' allows competing teams to compare the performance of their machine translation systems for various language pairs. In 2020, for the first time, one of the language pairs was Inuktitut-English. Systems were evaluated both on the quality of their outputs when translating in both directions: from English to Inuktitut, and from Inuktitut to English. Inuktitut is the first polysynthetic language to participate in the WMT competition.

NRC sponsored the human evaluation of translations from English to Inuktitut. We funded experienced translators employed by Pirurvik Centre who are fluent in Inuktitut to score a large number of outputs from machine translation systems into Inuktitut. In addition, NRC created its own system for translating between Inuktitut and English (in both directions).

The evaluation was a tremendous success: 12 machine translation systems from around the world participated, including the NRC's. System outputs were anonymized to prevent conflicts of interest. The NRC's machine translation system came second in both language directions. The inclusion of Inuktitut in the WMT competition is spurring research into the language by machine translation experts, which will ultimately benefit Inuktitut speakers.

Publications

Project team

Alain Désilets

Alain Désilets

Natural language processing applications developer. Led the WeBInuk project, which allowed translators to search large amounts of English-Inuktut parallel content. He is now developing an updated version of WeBInuk, and iutools for Inuktut.

Eric Joanis

Eric Joanis

Computational linguistics; statistical natural language processing; machine translation; software optimization and robustness.

Rebecca Knowles

Machine translation researcher; computer-aided translation; low-resource machine translation.

Gavin Nesbitt

Gavin Nesbitt

Director, Pirurvik Centre

 

Contact us

Janet Tamalik McGrath
Inuktut Language Consultant, Pirurvik Centre
Email: info@pirurvik.ca

The Legislative Assembly of Nunavut
Email: leginfo@assembly.nu.ca

Roland Kuhn
Project Leader, Indigenous Languages Technology Project, NRC
Telephone: 613-993-0821
Email: Roland.Kuhn@nrc-cnrc.gc.ca
LinkedIn: Roland Kuhn

Related links