Machine Translation for Less-Resourced Languages - Data Sparsity and Divergence for English-Maithili

Authors

  • Ritu Nidhi, Tanya Singh, D.K. Lobiyal

Abstract

Machine Translation (MT) is a resource intensive task requiring huge sets of parallel corpora. Maithili
– one of the 22 scheduled Indian languages – is a low resource language facing data sparsity related challenges
in technology development. The authors in this paper discuss the efforts to create (a) English-Maithili parallel
corpora by partially bootstrapping English-Hindi and Hindi-Maithili automatic translations, (b) training a
Statistical Machine Translation (SMT) model on the Moses platform to develop an English to Maithili Machine
Translation (EMMT), and (c) evaluate errors caused by divergence between the English-Maithili pair. The
paper also presents data, training and evaluation statistics.

Published

2020-05-30

Issue

Section

Articles