Effects of Dataset Size on the Accuracy of Dialects Classification Models

O.K. Adejumobi1, A.I.O.Yussuff 1 and A. A. Adenowo1

Published Paper

O.K. Adejumobi1, A.I.O.Yussuff 1 and A. A. Adenowo1
Nigeria
Page: 545-561
Published on: 2024 March

Thispaper determined the effects of dataset size on theaccuracy of a dialects classification models. To achieve this aim, an experimental methodology, where two (2) datasets A and B of varying sizes were used. Dataset A has a total number of 500 samples (100 samples for each of the classes) while Dataset B has a total number of 7000 samples (1400 samples for each of the classes). Both datasets were divided into; 70%, for network training, 20%, for validation and 10%, for prediction. The datasets contain audio samples of Egba, Ekiti, Ibadan, Ijebu and Ondo dialects collected from participants via mobile phones, radio and sound recorders. A Convolutional Neural Network (CNN) Classifier was developed.The process of achieving the objective of this research was divided into four (4) main stages namely: speech signals acquisition, data pre-processing, speech data classification and Model training/ testing and evaluation. The Model was implemented on Matlab 2022b platform. With the same Classifier, the results showed that the larger sized dataset ‘B’ gave a better performance accuracy of 100% for all the classes. While the smallerdataset ‘A’ gave a performance accuracy of the Model’s predictions for Egba, Ekiti, Ibadan, Ijebu and Ondo as 98.8%, 98.2%, 96.8%, 95.1% and 97.4% respectively. However, it is recommended that the complexity of the Model be considered before increasing the datasets to avoid under-fitting of the network.

PDF