Project to segment and index audio recordings of Indigenous languages

 

While there are tens of thousands of hours of speech recordings in Indigenous languages, they are unfortunately not annotated or indexed, meaning it is not possible to perform a keyword search through archives. The NRC is working with the Computer Research Institute of Montreal and other collaborators on technologies to segment and index recordings of Indigenous languages, to enable keyword search and support more efficient annotation.

So far, our work has been focused on Inuktitut and Cree. We are now targeting other languages such as Tsuut'ina and Michif to explore their specific properties and ensure that our tools are applicable to a broad range of Indigenous languages.

Collaborators

Computer Research Institute of Montreal

The Computer Research Institute of Montréal (CRIM) has a long and distinguished record of accomplishments in technologies related to speech recognition. Its audio content indexing technology catalogues the spoken content of very large audio databases, making such content accessible through search engines. CRIM has applied this technology to the archives of the National Film Board of Canada and to the collected testimonies of the Bastarache investigative commission. CRIM's speaker recognition technology, which identifies the person who generated a particular segment of speech, has consistently ranked among the top entries in international evaluations of speaker recognition systems, and is now used worldwide.

Canadian Broadcasting Corporation

The Canadian Broadcasting Corporation (CBC) creates programming by and for Indigenous peoples, providing services in 8t Indigenous/Inuit languages, and possesses a wealth of recordings of Indigenous languages being spoken.

Pirurvik Centre

Pirurvik is a centre of excellence for Inuit Language, culture and well-being. It was founded in the fall of 2003, and is based in Nunavut's capital, Iqaluit. Pirurvik is selecting audio recordings in spoken Inuktut that are original language with a depth of vocabulary and not 'thinking in English', then transcribing them into written form.

Objectives

  • Access recordings of Indigenous languages being spoken (audio files) and reliable transcriptions of those files; use the data to perform speech segmentation for easier data annotation
  • Create an indexation tool for keyword search in content

Deliverables

  • Audio indexation tools developed in this project will be made accessible via two established and openly accessible platforms: VESTA and ELAN.

Activities

Source audio material

The CBC is providing CRIM with access to East James Bay Cree recordings, so that CRIM can develop audio segmentation and analysis tools suitable for indexing audio recordings in Indigenous languages. The CBC has shared over 1,343 hours of radio programming originally broadcast by CBC North from January 2015 to December 2016. These 1,312 audio files, which contain studio/telephone quality speech as well as music, will be critical to the success of the project.

The Pirurvik Centre is selecting materials in spoken Inuktut that are original language with a depth of vocabulary and not 'thinking in English' while speaking Inuktut, and transcribing those recordings into written format.  The transcribed Inuktut speech data will subsequently be used by the NRC and CRIM to develop speech recognition tools that will make it possible to search other Inuktut speech recordings using text queries. This will make it easier for people who speak Inuktut to access and navigate audiovisual documents in their language.

Speech segmentation for easier data annotation

CRIM is developing simple tools to segment speech recordings:

""

Figure 1: Automatic segmentation displayed in ELAC linguistic annotation software

  • Voice activity detection separates audio files into speech and non-speech data. CRIM developed and tested a deep neural network based detector, trained on large amounts of speech in various languages. See Figure 1.
  • Speaker retrieval is used to identify when a given speaker is talking, using a short sample of the speaker's voice (query-by-example). CRIM developed a system based on i-vectors and are currently improving it with a deep learning approach.
  • CRIM created a language labelling tool that can identify spoken Inuktitut and East Cree, based on a 5-second sample, out of 32 languages.

These tools can be used with software that linguists are familiar with and should make annotation of speech currently being collected easier for a variety of languages.

Indexation tool for keyword search in content

CRIM also plans to build systems that will make it possible to search for particular words or phrases in audio recordings in some Indigenous languages. This will not be full speech recognition and we will not be creating systems that are able to produce high-quality transcriptions of everything that was said in a recording. Rather, the systems will enable audio keyword searches, so that users will be able to search quickly through long audio recordings for particular words or topics. To reach that goal, we must adapt the main components of speech recognition which model words, phonemes and speech sounds, and find their limits when applied to Indigenous languages.

  • Word-based representations: Word-based representations do not work for Inuktitut. In English, a vocabulary of 20,000 words is large enough so that only 5% of the words in a new text will be out of the vocabulary. In contrast, our Inuktitut document collection contains a vocabulary of 1.3 million distinct words, and yet in any new Inuktitut text about 60% of the words have never been seen before, because of Inuktitut's language structure. CRIM is developing new approaches that can model the rich vocabulary observed in many Indigenous languages in Canada without relying on a limited set of words. See Figure 2.

    ""

    Figure 2: Inuktitut vocabulary size vs text out-of-vocabulary rate. Out-of-vocabulary rate stays high even with very large vocabularies.

  • Phonetic transcriptions of East Cree: CRIM was able to automatically produce phonetic transcriptions of East Cree with less than 10% error, creating a system from scratch with only 4 hours of pre-transcribed material.

  • Exact word positions: CRIM showed that a speech recognizer trained on large amounts of English can find exact word positions in audio recordings, even for Inuktitut and Cree texts, which makes it possible to create audio books with synchronized text to be used as educational material and language learning apps. See Figure 3.

    Figure 3: Inuktitut text aligned with audio recording to enable read-along and other educational apps

Deploying audio indexation tools to communities, linguists and researchers

To meet the needs of Indigenous communities, linguists, and researchers, the audio indexation tools developed in this project will be made accessible via two established platforms:

  • VESTA, a collaborative work platform for research software developed by CRIM and financed in part by CANARIE. VESTA provides access to advanced multimedia processing for content hosted on CANARIE servers.
  • ELAN, an open source software developed by the Max Planck Institute for annotating bodies of oral recordings (corpuses). ELAN is an efficient tool for manual speech annotation on a PC that is broadly used in linguistics and language documentation.

CRIM has created an extension for ELAN that can be easily downloaded and provides access to all of the services offered by VESTA. This will enable multiple parties to collaborate on a corpus using VESTA tools, inside the familiar ELAN interface.

In collaboration with Indigenous language researchers, our team then determined which tools to prioritize to support their work, and these have been added to VESTA. Available speech segmentation technologies include:

  • Voice activity detection: separates speech from noise or music, using deep neural networks trained in linguistic diversity
  • Speaker diarisation: helps distinguish different speakers from one another in a conversation, regardless of language
  • Multi-track separation: separates voices during interviews or panels where several speakers are each wearing a lapel/personal mic
  • Language retrieval: enables retrieval of speech segments within in a recording that are spoken in a given language, in 32 languages including East Cree and Inuktitut
  • Speaker retrieval: enables retrieval of speech segments by a given speaker, regardless of language.

As the project advances, the team hopes to add other services to its VESTA-ELAN extension, such as voice-to-text alignment and keyword search, which would enable the development of digital speech apps.

Project team

Gilles Boulianne

Gilles Boulianne

Senior Researcher in Automatic Speech Processing
Computer Research Institute of Montreal

Vishwa Gupta

Vishwa Gupta

Senior Researcher in Automatic Speech Processing
Computer Research Institute of Montreal

 

Contact us

Antonia Leney-Granger, Communications Agent
Computer Research Institute of Montréal

Telephone: 514-840-1234
Email: medias@crim.ca

Roland Kuhn, Project Leader
Indigenous Languages Technology Project

Telephone: 613-993-0821
Email: Roland.Kuhn@nrc-cnrc.gc.ca
LinkedIn: Roland Kuhn

Related links