liu.seSearch for publications in DiVA
System disruptions
We are currently experiencing disruptions on the search portals due to high traffic. We are working to resolve the issue, you may temporarily encounter an error message.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Improving Speech Recognition for Arabic language Using Low Amounts of Labeled Data
Linköping University, Department of Computer and Information Science.
2021 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The importance of Automatic Speech Recognition (ASR) Systems, whose job is to generate text from audio, is increasing as the number of applications of these systems is rapidly going up. However, when it comes to training ASR systems, the process is difficult and rather tedious, and that could be attributed to the lack of training data. ASRs require huge amounts of annotated training data containing the audio files and the corresponding accurately written transcript files. This annotated (labeled) training data is very difficult to find for most of the languages, it usually requires people to perform the annotation manually which, apart from the monetary price it costs, is error-prone. A supervised training task is impractical for this scenario. 

The Arabic language is one of the languages that do not have an abundance of labeled data, which makes its ASR system's accuracy very low compared to other resource-rich languages such as English, French, or Spanish. In this research, we take advantage of unlabeled voice data by learning general data representations from unlabeled training data (only audio files) in a self-supervised task or pre-training phase. This phase is done by using wav2vec 2.0 framework which masks out input in the latent space and solves a contrastive task. The model is then fine-tuned on a few amounts of labeled data. We also exploit models that have been pre-trained on different languages, by using wav2vec 2.0, for the purpose of fine-tuning them on Arabic language by using annotated Arabic data.  

We show that using wav2vec 2.0 framework for pre-training on Arabic is considerably time and resource-consuming. It took the model 21.5 days (about 3 weeks) to complete 662 epochs and get a validation accuracy of 58%.  Arabic is a right-to-left (rtl) language with many diacritics that indicate how letters should be pronounced, these two features make it difficult for Arabic to fit into these models, as it requires heavy pre-processing for the transcript files. We demonstrate that we can fine-tune a cross-lingual model, that is trained on raw waveforms of speech in multiple languages, on Arabic data and get a low word error rate 36.53%. We also prove that by fine-tuning the model parameters we can increase the accuracy, thus, decrease the word error rate from 54.00% to 36.69%.   

Place, publisher, year, edition, pages
2021. , p. 37
Series
LIU-IDA/STAT-A--21/045—SE
Keywords [en]
Arabic Language, Speech Recognition, ASR, Signal Processing, wav2vec, XLSR
National Category
Signal Processing
Identifiers
URN: urn:nbn:se:liu:diva-176437OAI: oai:DiVA.org:liu-176437DiVA, id: diva2:1567188
External cooperation
DigitalTolk
Presentation
2021-06-02, 09:20 (English)
Supervisors
Examiners
Available from: 2021-06-18 Created: 2021-06-16 Last updated: 2021-06-18Bibliographically approved

Open Access in DiVA

fulltext(2001 kB)984 downloads
File information
File name FULLTEXT01.pdfFile size 2001 kBChecksum SHA-512
13d450ffa45e3ee98ebd31ae7b2abfc45bcf6f04d7f1f60cc712f220a93e4da37aa00ab0e79baba2c09449c7160d0d8e7852f8741cc2c12a413eb5a152218179
Type fulltextMimetype application/pdf

By organisation
Department of Computer and Information Science
Signal Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 984 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1870 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf