Toward automatic speech recognition in audio recordings of Indigenous languages

While there are tens of thousands of hours of speech recordings in Indigenous languages, most of these recordings are unfortunately not annotated or indexed, meaning it is not possible to perform a keyword search through archives. The Computer Research Institute of Montreal and its collaborators developed language labelling and speech segmentation tools for recordings of Indigenous languages, to support more efficient annotation and ultimately, enable automatic speech recognition.

Work began with Inuktitut and Cree and was expanded to other languages such as Innu, Dénésiline, Tsuut'ina and Michif. Exploring specific properties of different Indigenous languages ensures that our tools are applicable to a broad range of languages.

Collaborators

Computer Research Institute of Montreal

The Computer Research Institute of Montréal (CRIM) has a long and distinguished record of accomplishments in technologies related to speech recognition. Its audio content indexing technology catalogues the spoken content of very large audio databases, making such content accessible through search engines. CRIM has applied this technology to the archives of the National Film Board of Canada and to the collected testimonies of the Bastarache investigative commission. CRIM's speaker recognition technology, which identifies the person who generated a particular segment of speech, has consistently ranked among the top entries in international evaluations of speaker recognition systems, and is now used worldwide.

Canadian Broadcasting Corporation – East James Bay Cree

The Canadian Broadcasting Corporation (CBC) creates programming by and for Indigenous peoples, providing services in 8t Indigenous/Inuit languages, and possesses a wealth of recordings of Indigenous languages being spoken.

Pirurvik Centre - Inuktitut

Pirurvik is a centre of excellence for Inuit Language, culture and well-being. It was founded in the fall of 2003, and is based in Nunavut's capital, Iqaluit. Pirurvik is selecting audio recordings in spoken Inuktut that are original language with a depth of vocabulary and not 'thinking in English', then transcribing them into written form.

  • Prairie to Woodland Indigenous Language Revitalization Circle (P2WILRC) - Michif
  • CKAU-Kushapetsheken – Innu
  • Missinipi Broadcasting Corporation - Dénésuline
  • Carleton University, School of Linguistics and Language Studies

Objectives

  • Make it easier to access recordings of Indigenous languages being spoken (audio files) and to create reliable transcriptions of those files
  • Perform speech segmentation for easier data annotation
  • Conduct experiments toward automatic speech recognition (ASR) for Indigenous languages: Inuktitut, East Cree, Innu, and Dénésuline

Deliverables

  • Automatic speech recognition (ASR) for Inuktitut, East Cree, Innu and Dénésuline
  • Practical tools in ELAN to make organizing and transcribing speech data much easier

Activities

Source audio material, East James Bay Cree, 2018-2019

The CBC provided CRIM with access to East James Bay Cree recordings, so that CRIM could develop audio segmentation and speech recognition tools for recordings in Indigenous languages. The CBC shared over 1,343 hours of radio programming originally broadcast by CBC North from January 2015 to December 2016. These 1,312 audio files, which contain studio/telephone quality speech as well as music, were critical to the success of the project.

Production and editing of 75 hours of Inuktut speech 2018-2019

The Pirurvik Centre selected materials in spoken Inuktut that are original language with a depth of vocabulary and not 'thinking in English', and transcribed those recordings into written format. The transcribed Inuktut speech data was used by CRIM to develop audio segmentation and automatic speech recognition tools.

Collecting, annotating, and time aligning Tsuut'ina narratives 2019-2020

Christopher Cox from Carleton University and Tsuut'ina Elder Bruce Starlight collected 25-30 hours of studio-quality audio recordings of Tsuut'ina textual and lexical material, read by Bruce Starlight. They produced time-aligned, bilingual transcripts of all audio recordings and an illustrated publication of some recorded narratives, to be distributed to Tsuut'ina Nation citizens and archived with the Tssu'tina Museum for long term preservation.

Developing Michif lexical resources 2019-2020

Olivia Sammons from Carleton University and Verna DeMontigny from the Woodlands to Prairie Indigenous Language Revitalization Circle collected audio recordings and an accompanying database of Michif lexical materials representing 350 pages or about 250 hours of audio and their ELAN transcripts. All project material will be deposited with the Prairies to Woodlands Indigenous Language Revitalization Circle and additional Métis community organizations or language educators.

Speech segmentation for easier data annotation 2018-2020

In collaboration with Indigenous language researchers, CRIM determined which tools to prioritize, and developed simple tools to segment speech recordings:

  • Voice activity detection: separates audio files into speech and non-speech data. CRIM developed and tested a deep neural network based detector, trained on large amounts of speech in various languages. See Figure 1.
  • Speaker retrieval: enables retrieval of speech segments by a given speaker, regardless of language, using a short sample of the speaker's voice (query-by-example). CRIM developed a system based on i-vectors and improved it with a deep learning approach.
  • Speaker diarisation: helps distinguish different speakers from one another in a conversation, regardless of language.
  • Multi-track separation: separates voices during interviews or panels where several speakers are each wearing a lapel/personal mic.
  • Language labelling and retrieval: identifies spoken Inuktitut and East Cree, based on a 5-second sample, out of 32 languages. The tool also enables retrieval of speech segments within a recording that are spoken in a given language. CRIM recently expanded this tool to Innu and Dénésuline.

Figure 1: Automatic segmentation displayed in ELAN linguistic annotation software

Toward automatic speech recognition (ASR) for polysynthetic languages

Most Indigenous languages spoken in Canada, including Inuktut and Cree, are polysynthetic. A typical word is made up of about 7-10 small pieces called morphemes. Because so many different combinations of morphemes are possible, the majority of words in a given text or speech have never occurred before in the history of the language. This poses great difficulties for automatic speech recognition (ASR). ASR systems for languages like English and French rely on words they have ‘heard' before in acoustic training data. With polysynthetic languages, the system has never heard most of the words it would encounter in a new recording.

CRIM's experiments on ASR for Inuktut and East Cree have focused on determining the best unit for acoustic modeling: morphemes, syllables, or hybrid units, combined with models of word frequency? They also studied two different types of ASR systems: those trained to recognize speech from a variety of speakers (speaker-independent) and those trained to recognize speech from a particular speaker (speaker-dependent). The CRIM researchers have made major strides and increased the accuracy of ASR for Inuktut and East Cree. Though both are polysynthetic, they are unrelated and very different from each other phonetically, implying that conclusions drawn from both sets of experiments may apply to other polysynthetic languages. The experiments are described in detail in the publications below.

Deploying audio segmentation tools to communities, linguists and researchers: VESTA and ELAN

To meet the needs of Indigenous communities, linguists, and researchers, the audio segmentation tools developed in this project will be made accessible via two established platforms:

  • VESTA, a collaborative work platform for research software developed by CRIM and financed in part by CANARIE. VESTA provides access to advanced multimedia processing for content hosted on CANARIE servers.
  • ELAN, an open source software developed by the Max Planck Institute for annotating bodies of oral recordings (corpuses). ELAN is an efficient tool for manual speech annotation on a PC that is broadly used in linguistics and language documentation.

CRIM has created an extension for ELAN that can be easily downloaded and provides access to all of the services offered by VESTA. This will enable multiple parties to collaborate on a corpus using VESTA tools, inside the familiar ELAN interface.

The team hopes to add other services to its VESTA-ELAN extension, such as voice-to-text alignment and keyword search, to enable the development of digital speech apps.

Publications

Project team

  • Gilles Boulianne, Senior Researcher in Automatic Speech Processing, Computer Research Institute of Montreal
  • Vishwa Gupta, Senior Researcher in Automatic Speech Processing, Computer Research Institute of Montreal
  • Christopher Cox, Assistant Professor, Applied Linguistics and Discourse Studies, Carleton University
  • Olivia Sammons, Adjunct Research Professor, School of Linguistics and Languages Studies, Carleton University

Contact us

Antonia Leney-Granger, Communications Agent
Computer Research Institute of Montréal
Telephone: 514-840-1234
Email: medias@crim.ca

Roland Kuhn, Project Leader
Indigenous Languages Technology Project
Email: Roland.Kuhn@nrc-cnrc.gc.ca
LinkedIn: Roland Kuhn

Related links