A Visual-Acoustic Modeling Framework for Robust Dysarthric Speech Recognition Using Synthetic Visual Augmentation and Transfer Learning

P. Hemalatha; K. Vinay Kumar; Uppalapati Srilakshmi; Putta  Brundavani

doi:10.5750/sijme.v167iA2(S).2498

PDF

Published: Jul 30, 2025

DOI: https://doi.org/10.5750/sijme.v167iA2(S).2498

Keywords:

Dysarthric Speech Recognition Visual-Acoustic Modeling Speech Vision (SV) Data Augmentation Generative Adversarial Networks (GANs) Transfer Learning Spectrogram-Based ASR Phoneme Variability UA-Speech Dataset Inclusive Speech Technology

P. Hemalatha

Department of Computer Science and Engineering,Vel Tech Rangarajan Dr.Sagunthala R&D, Institute of Science and TechnologyChennai, India

Dr. K. Vinay Kumar

Department of CSE(AI&ML), Kakatiya Institute of Technology and Science, Hanamkonda, Koukonda, Warangal, Telangana-506015, India.

Dr. Uppalapati Srilakshmi

Department of CSE, Sridevi Womens Engineering College, Vattinagulapally, Gachibowli, Hyderabad,India.

Dr. Putta Brundavani

Department of ECE, RSR Engineering College, Kavali, SPSR Nellore - 524142 Andhra Pradesh, India.

Abstract

Dysarthria is a motor speech disorder that affects an individual's ability to control his/her muscles, which seriously affects with their ability to communicate and to perform digital interaction. Automatic Speech Recognition (ASR) systems have made tremendous improvements but are limited to dysarthric speech, specifically in severe cases with they are unable to describe phonemes consistently. Also, it is provided with insufficient training data and unintuitive phoneme labelling. We present a visual acoustic modelling technique in a dysarthric-targeted ASR system. We suggest Speech Vision (SV). SV does not just depend on the audio but transforms the speech to visual spectrogram representations and trains the deep neural networks to identify the shape of the phoneme rather than the phoneme's variability when spoken. This reduces the solution from the traditional acoustic phoneme modelling that is required to address central dysarthric speech challenges. Specifically, to face data scarcity, SV uses visual data augmentation by producing synthesized dysarthric spectrograms from Generative Adversial Networks (GANs) and time-frequency distortions. Moreover, transfer learning is applied to utilize pre-trained healthy speech models to dysarthric speech for more robustness and generalization. We compare SV against the existing systems, DeepSpeech, DysarthricGAN-ASR, and Transfer-ASR, using the UA-Speech dataset. In 67% of the speakers, SV increased the accuracy of recognition by an average of 18.5%, with a significant reduction in average Word Error Rate (WER), particularly for severe dysarthria. By adopting visual learning, synthetic augmentation, and transfer learning in a single pipeline, SV is a new solution to overcome the problem of dysarthric ASR and potentially establishes ASR for speech-impaired populations with enhanced accessibility.

Issue

Vol. 167 No. A2(S) (2025): Technologies and Their Effects on Real-Time Social Development

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

References

Farneti, Daniele, Claudio Luzzatti, Arno Olthoff, Antonio Schindler, Rachel Zeng, Claudio Luzzatti, Antonio Schindler et al. "16 Basics of Acquired Motor Speech Disorders (Dysarthria, Dyspraxia)." In Phoniatrics III: Acquired Motor Speech and Language Disorders–Dysphagia–Phoniatrics and COVID-19, pp. 3-13. Cham: Springer Nature Switzerland, 2025.

Aiello, Edoardo Nicoló, Enrico Alfonsi, Mathieu Balaguer, Salvatore Biondi, Stefano Cappa, Giuseppe Cosentino, Mauro Fresia et al. "18 Diagnosis and Differential Diagnosis of Acquired Motor Speech Disorders (Dysarthria, Dyspraxia)." In Phoniatrics III: Acquired Motor Speech and Language Disorders–Dysphagia–Phoniatrics and COVID-19, pp. 31-100. Cham: Springer Nature Switzerland, 2025.

Liu, Yao, Faizahani binti Ab Rahman, and Farah binti Mohamad Zain. "A systematic literature review of research on automatic speech recognition in EFL pronunciation." Cogent Education 12, no. 1 (2025): 2466288.

Luo, Xiao, Le Zhou, Kathleen Adelgais, and Zhan Zhang. "Assessing the Effectiveness of Automatic Speech Recognition Technology in Emergency Medicine Settings: A Comparative Study of Four AI-powered Engines." Journal of Healthcare Informatics Research (2025): 1-19.

Kotte Vinay Kumar, Narasimha Reddy Soora, & N.C.Santoshkumar. (2023). Fundus Image Classification for the Early Detection of Issues in the DR for the Effective Disease Diagnosis. Journal of Computer Allied Intelligence, 1, no.1(2023): 27-40.

Bhat, Chitralekha, and Helmer Strik. "Speech Technology for Automatic Recognition and Assessment of Dysarthric Speech: An Overview." Journal of Speech, Language, and Hearing Research 68, no. 2 (2025): 547-577.

Showrov, Atif Ahmed, Md Tarek Aziz, Hadiur Rahman Nabil, Jamin Rahman Jim, Md Mohsin Kabir, M. F. Mridha, Nobuyoshi Asai, and Jungpil Shin. "Generative adversarial networks (GANs) in medical imaging: advancements, applications and challenges." IEEE Access (2024).

Rekha Gangula, & Dayakar Thalla. (2024). Machine Learning in Predicting Alzheimer’s Disease: Exploring Applications and Advancements. Journal of Computer Allied Intelligence, 2, no.1(2024): 1-7.

Liu, Yun, Xuechen Liu, Xiaoxiao Miao, and Junichi Yamagishi. "Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data." arXiv preprint arXiv:2412.12512 (2024).

S. R. Shahamiri and S. S. B. Salim, "Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach", Adv. Eng. Informat., vol. 28, no. 1, pp. 102-110, Jan. 2014.

H. Kim et al., "Dysarthric speech database for universal access research", Proc. 9th Annu. Conf. Int. Speech Commun. Assoc. (INTERSPEECH), pp. 1741-1744, 2008.

Srinivasa Sai Abhijit Challapalli. Sentiment Analysis of the Twitter Dataset for the Prediction of Sentiments. Journal of Sensors, IoT & Health Sciences, 2, no.4 (2024): 1-15.

D. Ellis and N. Morgan, "Size matters: An empirical study of neural network training for large vocabulary continuous speech recognition", Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 2, pp. 1013-1016, Mar. 1999.

G. Saon and J.-T. Chien, "Large-vocabulary continuous speech recognition systems: A look at some recent advances", IEEE Signal Process. Mag., vol. 29, no. 6, pp. 18-33, Nov. 2012.

S. Sehgal and S. Cunningham, "Model adaptation and adaptive training for the recognition of dysarthric speech", Proc. SLPAT 6th Workshop Speech Lang. Process. Assistive Technol., pp. 65-71, 2015.

N. Rajeswari and S. Chandrakala, "Generative model-driven feature learning for dysarthric speech recognition", Biocybernetics Biomed. Eng., vol. 36, no. 4, pp. 553-561, 2016.

B. Vachhani, C. Bhat, B. Das and S. K. Kopparapu, "Deep autoencoder based speech features for improved dysarthric speech recognition", Proc. Interspeech, pp. 1854-1858, Aug. 2017.

B. Vachhani, C. Bhat and S. K. Kopparapu, "Data augmentation using healthy speech for dysarthric speech recognition", Proc. Interspeech, pp. 471-475, Sep. 2018.

K. Gurugubelli, A. K. Vuppala, N. P. Narendra and P. Alku, " Duration of the rhotic approximant , in spastic dysarthria of different severity levels ", Speech Commun., vol. 125, pp. 61-68, Dec. 2020.

Takahashi, Satoshi, Yusuke Sakaguchi, Nobuji Kouno, Ken Takasawa, Kenichi Ishizu, Yu Akagi, Rina Aoyama et al. "Comparison of vision transformers and convolutional neural networks in medical image analysis: a systematic review." Journal of Medical Systems 48, no. 1 (2024): 84.

Mzoughi, Hiba, Ines Njeh, Mohamed BenSlima, Nouha Farhat, and Chokri Mhiri. "Vision transformers (ViT) and deep convolutional neural network (D-CNN)-based models for MRI brain primary tumors images multi-classification supported by explainable artificial intelligence (XAI)." The Visual Computer (2024): 1-20.

Jian, Yueao, Peng Hu, Qihan Zhou, Nan Zhang, Deng’an Cai, Guangming Zhou, and Xinwei Wang. "A novel bidirectional LSTM network model for very high cycle random fatigue performance of CFRP composite thin plates." International Journal of Fatigue 190 (2025): 108627.

Venkateswarlu Chandu, Nkosingiphile Kunene, Sarah Motika, Peace Andrew John, & Regina Banda. Automated Pattern Estimation For Classification Of Consumer Perception On Green Banking. Journal of Computer Allied Intelligence, 2(2024): 79-93.

Srinivasa Sai Abhijit Challapalli. Optimizing Dallas-Fort Worth Bus Transportation System Using Any Logic. Journal of Sensors, IoT & Health Sciences, 2(2024): 40-55.

Aziz, Sumair, Muhammad Umar Khan, Adil Usman, Muhammad Faraz, Yazeed Yasin Ghadi, and Gabriel Axel Montes. "Bearing faults classification using novel log energy-based empirical mode decomposition and machine Mel-frequency cepstral coefficients." Digital Signal Processing 156 (2025): 104776.

Bhat, Chitralekha, and Helmer Strik. "Two-stage data augmentation for improved ASR performance for dysarthric speech." Computers in Biology and Medicine 189 (2025): 109954.

B.Ashok Kumar, K.Vijayachandra, G.Naveen Kumar, & V.N.Lakshmana Kumar. Blockchain Technology Communication Technology Model for the IoT. Journal of Computer Allied Intelligence, 2(4), 20-35, 2024.

Wang, Qianli, Zihan Zhong, Satwinder Singh, Clarion Mendes, Mark Hasegawa-Johnson, Waleed Abdulla, and Seyed Reza Shahamiri. "Dysarthric Speech Conformer: Adaptation for Sequence-to-Sequence Dysarthric Speech Recognition." In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2025.

Liu, Xianglong, Huilin Feng, Ying Wang, Danyang Li, and Kun Zhang. "Hybrid model of ResNet and Transformer for efficient image reconstruction of electromagnetic tomography." Flow Measurement and Instrumentation (2025): 102843.

Genç, Hasan, Canan Koç, Esra Yüzgeç Özdemir, and Fatih Özyurt. "An innovative approach to classify meniscus tears by reducing vision transformers features with elasticnet approach." The Journal of Supercomputing 81, no. 4 (2025): 1-29.

Sudo, Yui, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, and Shinji Watanabe. "Joint Beam Search Integrating CTC, Attention, and Transducer Decoders." IEEE Transactions on Audio, Speech and Language Processing (2025).

Ramani, D. Roja, Naveen Chandra Gowda, S. Sreejith, and Shrikant Tangade. "Deep Bidirectional LSTM for Emotion Detection through Mobile Sensor Analysis." Environmental Monitoring Using Artificial Intelligence (2025): 201-223.

Lai, ZhengLin, MengYao Liao, and Dong Xu. "Dynamic Bi-Elman Attention Networks (DBEAN): Dual-Directional Context-Aware Representation Learning for Enhanced Text Classification." arXiv preprint arXiv:2503.15469 (2025).

Mounnan, Oussama, Larbi Boubchir, Otman Manad, Abdelkrim El Mouatasim, and Boubaker Daachi. "DBAC-DSR-BT: A secure and reliable deep speech recognition based-distributed biometric access control scheme over blockchain technology." Computer Standards & Interfaces 92 (2025): 103929.

Bhat, Chitralekha, and Helmer Strik. "Two-stage data augmentation for improved ASR performance for dysarthric speech." Computers in Biology and Medicine 189 (2025): 109954.

Yang, Junxiao, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang. "Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints." arXiv preprint arXiv:2503.01865 (2025)

T.Veeramakali, Syed Raffi Ahamed J, & Bagiyalakshmi N. Speech Signal Enhancement with Integrated Weighted Filtering for PSNR Reduction in Multimedia Applications. Journal of Computer Allied Intelligence, 2(3), 1-14, (2024).

Article Sidebar

Main Article Content

Abstract

Article Details

References