Enhancing African low-resource languages: Swahili data for language modelling

Shikali, Casper S.; Mokhosi, Refuoe

Please use this identifier to cite or link to this item: https://repository.seku.ac.ke/handle/123456789/7530

Title:	Enhancing African low-resource languages: Swahili data for language modelling
Authors:	Shikali, Casper S. Mokhosi, Refuoe
Keywords:	Natural language processing Deep learning Language modelling Unannotated data Word analogy Syllables Neural networks
Issue Date:	Aug-2020
Publisher:	Elsevier
Citation:	Data in Brief, Volume 31, 105951, August 2020
Abstract:	Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource languages because of inadequate data for NLP. In this article, we derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages. Therefore, we derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset. We envisage that the datasets will not only support language models but also other NLP downstream tasks such as part-of-speech tagging, machine translation and sentiment analysis
Description:	https://doi.org/10.1016/j.dib.2020.105951
URI:	https://www.sciencedirect.com/science/article/pii/S2352340920308453 http://repository.seku.ac.ke/xmlui/handle/123456789/7530
ISSN:	2352-3409
Appears in Collections:	School of Science and Computing (JA)

Files in This Item:

File	Description	Size	Format
Shikali_Enhancing African low-resource languages....pdf	Abstract	3.59 kB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets