CebQA: Cebuano Question Answering System

Lead Researcher(s): Jhoanna Rica T. Lagumbay and Robert R. Roxas
Status: Published

Abstract/summary: Question answering (QA) system answers queries given a corpus of natural language documents. QA systems have seen significant advancements across various languages, such as Arabic and Amharic, using transformer-based models. There remains, however, a notable gap for the Cebuano language, a widely spoken language in the Philippines. One major barrier is the absence of a publicly available Cebuano QA dataset. This study addresses this gap by introducing a three-fold contribution: (1) a pseudonymization technique tailored to Cebuano texts to preserve privacy in news-based datasets, (2) the construction of Cebuano Question Answering Dataset (CebQuAD), the first Cebuano QA dataset, and (3) the development of Cebuano Question Answering (CebQA) system, an end-to-end QA system. To build CebQuAD, Cebuano news articles were collected and pseudonymized to protect personal identities. Question-answer pairs were generated using GPT-4o mini, validated by Cebuano speakers, and split into training, testing and validation sets. The CebQA system incorporates a retriever-reader architecture, employing ElasticSearch/BM25 and FAISS/DPR for indexing and retrieval and fine-tuning XLM-RoBERTa for answer extraction. Results show that BM25 achieved the highest retrieval accuracy, while the best reader attained an F1 score of 79.22. The end-to-end system has an F1 score at 49.50 at k = 1, aligning with the retriever’s 63% accuracy, highlighting the viability of CebQA system as the first functional end-to-end QA system for the Cebuano language.

Keywords:

Natural Language Processing
NLP
Question Answering
Pseudonymization
GPT-40 mini
XLM-R
BM25
Cebuano DPR
Cebuano Question Answering Dataset
Cebuano Language

Downloadable PDF