ChavacanoMT: a corpus and evaluation of neural machine translation for Philippine Creole Spanish

Lead Researcher(s): Aileen Joan Vicente, Theresse Faith Amamampang, Dunn Dexter Lahaylahay, and Charibeth Cheng
Status: Published

Abstract/summary: Chavacano, formally known as Philippine Creole Spanish, is the only Creole language spoken in the Philippines. As with many languages, particularly Creoles, computational research on Chavacano remains limited due to the scarcity of available corpora. This paper presents ChavacanoMT, a benchmark corpus developed to support machine translation research on Philippine Creole Spanish. ChavacanoMT comprises 767,053 parallel sentences between Chavacano and its related languages: Spanish, Cebuano, Hiligaynon, Tagalog, and English. The corpus was constructed using data scraped from Bible translations and articles published on the Jehovah’s Witnesses website. This paper also presents the performance of a multilingual neural machine translation model generated using ChavacanoMT. We report an overall 17 BLEU score on a fine-tuned mT5 model, outperforming an mT5-based model trained from scratch. Our experiments show that ChavacanoMT can generate models on par with similar systems that translate between English and some Philippines languages despite having fewer sentence samples used in training. We also report an improved Chavacano translation to and from its related languages that can be used as benchmark data. In particular, we highlight more than 13 BLEU points of improvement in the translation from Chavacano to English. The study opens avenues for exploring cross-linguistic interactions of Chavacano and its related languages in its translation that may benefit other low-resource languages.

Keywords:

  • Philippine-Creole Spanish
  • Chavacano translation corpus
  • Multiligual translation
  • Chavacano