The National Research Council of Canada's (NRC) multilingual text processing team carries out research and development in multilingual natural language processing (NLP). This includes machine translation and other language technologies for multilingual contexts.
In particular, we collaborate with government, industry, academia, and other partners on language technologies to support Canada's official languages and the revitalization of Indigenous languages. We also conduct foundational research and excel in international competitions where the calibre of our research and technology is benchmarked against other leaders in the field.
What we offer
Housed within the NRC's Digital Technologies Research Centre, our team's core competencies include:
- computer-assisted translation
- machine learning for natural language applications
- machine translation
- multilingual text mining
- social media analysis and modelling
- translation quality evaluation
We apply our expertise to:
- translation and language service providers, in support of the Government of Canada's Policy on Official Languages:
- computer-assisted translation with the Translation Bureau, Courts Administration Services, and private sector language service providers
- machine translation quality evaluation and estimation with the Translation Bureau
- parallel corpus filtering and cleaning with the Translation Bureau and the Université de Montréal
- translation routing with the Translation Bureau
- translation equivalence error detection with the Public Service Commission of Canada
- learning technologies:
- automatic language proficiency assessment and modelling
- Indigenous Languages Technology Project: software and tools to support Indigenous language schools, educators, students, communities, and technology developers, with multiple partners
- Language Comprehension Tool, a second language reading assistant for Canadian government employees, with the Translation Bureau
- machine translation for second-language writing with Dublin City University and the Université du Québec en Outaouais
- intelligence, monitoring, and security:
- detection of changes within an unfolding event in real time from news articles or social media
- machine translation of social media contents for business and security intelligence
Software and applications
- Portage statistical and neural network automatic translation software
- YiSi semantic machine translation evaluation metric software
- Multi-source translation for Sockeye neural machine translation system
- Document categorization toolbox
- Bayesian online change point detection – package for R programming language
Why work with us
Our team is a unique mix of world-class researchers with backgrounds in computational linguistics, engineering and machine learning, combined with strong, savvy software developers. Our collaborators appreciate our deep technical knowledge, our ability to deliver software components that are easy to integrate, and the state-of-the art results and models we can deliver from their data.
We can take translation and other language technologies from research concepts all the way to products suitable for distributors and end users. Past examples of language technologies we have developed and delivered include word alignment for terminology extraction, statistical machine translation for language comprehension, and cross-lingual semantic similarity for detecting translation errors.
International competitions and shared tasks
Our team is a regular participant and top performer in several tasks at the annual Conference on Machine Translation (PDF, 259 KB) (formerly called Workshop on Machine Translation or WMT). We are also a leading participant in the International Workshops on Semantic Evaluation (SemEval), the Discriminating Similar Languages series, and the Native Language Identification evaluations.
Team results: WMT 2019
- Machine translation (low-resource languages) (PDF, 1.1 MB)
- Parallel Corpus Filtering (PDF, 396 KB)
- Metrics (PDF, 1.1 MB)
- Quality Estimation (PDF, 259 KB)
Team results: WMT 2018
Team results: SemEval
- Cross-lingual textual similarity 2016 and 2017, task 1
- Cross-lingual word sense disambiguation 2013, task 10
- Second Language writing assistant 2014 task
Team results: Discriminating Similar Languages series
- A Report on the Third VarDial Evaluation Campaign 2019
- Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task 2016
- Overview of the DSL Shared Task 2015
- A Report on the DSL Shared Task 2014
Team results: Native Language Identification evaluations
- A Report on the 2017 Native Language Identification Shared Task (PDF, 317 KB)
- A Report on the First Native Language Identification Shared Task 2013 (PDF, 134 KB)
Team members
- Aidan Pine
- Anna Kazantseva
- Chi-kiu (Jackie) Lo
- Cyril Goutte
- Darlene Stewart
- Eddie Santos
- Éric Joanis
- Gabriel Bernier-Colborne
- Marc Tessier
- Michel Simard
- Patrick Littell
- Rebecca Knowles
- Roland Kuhn
- Samuel Larkin
- Serge Léger
- Sowmya Vajjala
- Yunli Wang
Image gallery
Contact us
Interested in applying our multilingual text processing expertise to your project? Contact our experts today!
Cyril Goutte
Team Leader, Multilingual Text Processing
Email: Cyril.Goutte@nrc-cnrc.gc.ca
Targeted industries
Information and communications technology; Analytics; Learning systems.
Locations
- Moncton
- Montréal Decelles
- Ottawa Montreal Road
- Edmonton
- Victoria
Selected publications
- Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: the NRC supervised submissions to the Parallel Corpus Filtering task
- Indigenous language technologies in Canada: assessment, challenges, and successes.
- Real-time change point detection using on-line topic models
- Cost weighting for neural machine translation domain adaptation
- A challenge set approach to evaluating machine translation
- Transferring markup tags in statistical machine translation: a two-stream approach
- Feature space selection and combination for native language identification.
- The trouble with SMT consistency
- Statistical Phrase-based Post-editing