Click here to join our discord server! Plus, U-Lingua Issue 11 is now out! Check it out here.

Comparing and Contrasting Forced Aligned Acoustic Models for Vernacular Speech

Abstract

Research into dialect variation has been streamlined with the introduction of forced alignment software. The purpose of this study is to evaluate the accuracy of the Montreal Forced Aligner (MFA) when analysing non-mainstream dialects. To achieve this, the alignment from a pretrained speech model was compared to hand aligned data. The pretrained model uses English speakers from the LibriSpeech database which is an open-source corpus of 1000 hours of speech. Four interviews of speakers from the Appalachian region of the United States were run through the MFA with the pretrained model. The speakers were chosen due to variation in demographic information like age and gender. These four files were compared to hand-aligned variations to identify the model’s accuracy on a non-mainstream dialect. No systematic issues were noted based on analysis of the original alignment of the four files. Altogether, there was 0.54–1.3% of errors that occurred across speakers. Errors primarily occurred when a word was out of the vocabulary or with transcription errors. While most of the errors were minor, there were instances, specifically with boundaries, where the alignment was off by multiple seconds. However, the aligner performed better than expected with overlapping speech and with productions that differed significantly from the training data. Therefore, a researcher could be confident that the MFA will perform at a usable level for Appalachian English with high quality recordings.