Improving Aphasic Communication Using Multimodal AI Systems
DOI:
https://doi.org/10.9781/ijimai.2026.2215Keywords:
Aphasia, ASR, HCI, Image Captioning, MultimodalityAbstract
Aphasia, often resulting from brain injuries, significantly impairs individuals’ language abilities, creating substantial challenges for verbal communication. Existing assistive technologies frequently fall short in addressing these specialised communication needs, underscoring the urgent demand for adaptive, intelligent support systems. This research proposes a dual approach: an Automatic Speech Recognition (ASR) module fine-tuned on aphasic speech, and a multimodal component that integrates visual context to infer the speaker’s intended meaning. The ASR system leverages fine-tuned versions of Whisper and Wav2Vec 2.0 on data from the AphasiaBank corpus. Results show a notable reduction in Word Error Rate (WER) when comparing base pre-trained ASR models with their finetuned versions, decreasing from 70.36% to 31.53% in a contextindependent setting, and from 61.25% to 35.60% in a speaker-independent evaluation, demonstrating robustness across different scenarios. In contrast to the ASR module, the goal of the multimodal component is not to produce a literal word-by-word transcription, but rather to reconstruct the speaker’s communicative intent using contextual information. To evaluate this capability, we conducted a human study assessing the system’s ability to interpret what the speaker truly meant. The results confirmed that outputs combining visual cues with language model reasoning more reliably captured communicative intent than audio-only transcriptions.
Downloads
References
[1] L. Rabiner, B. Juang, “An introduction to hidden markov models,” ieee assp magazine, vol. 3, no. 1, pp. 4–16, 1986, doi: https://doi.org/10.1109/MASSP.1986.1165342
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017, doi: https://doi.org/10.48550/arXiv.1706.03762
[3] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, 2023, pp. 28492–28518, PMLR.
[4] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, “wav2vec 2.0: A framework for selfsupervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020, doi: https://doi.org/10.48550/arXiv.2006.11477
[5] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman, et al., “Seamlessm4tmassively multilingual & multimodal machine translation,” arXiv preprint arXiv:2308.11596, 2023, doi: https://doi.org/10.48550/arXiv.2308.11596
[6] J. Li, et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022, doi: https://doi.org/10.48550/arXiv.2111.01690
[7] M. Ozh, D. Oralbekova, K. Alimhan, M. Othman, B. Zhumazhanov, “Development online models for automatic speech recognition systems with a low data level,” Annals of Mathematics and Physics, vol. 5, no. 2, pp. 107–111, 2022, doi: https://doi.org/10.17352/amp.000049
[8] J. Nouza, L. Mateju, P. Cerva, J. Zdansky, “Developing state-of-theart end-to-end asr for norwegian,” in International Conference on Text, Speech, and Dialogue, 2023, pp. 200–213, Springer.
[9] K. Deng, P. C. Woodland, “Adaptable end-to-end asr models using replaceable internal lms and residual softmax,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5, IEEE.
[10] S. Dhahbi, N. Saleem, T. S. Gunawan, S. Bourouis, I. Ali, A. Trigui, A. D. Algarni, “Lightweight realtime recurrent models for speech enhancement andautomatic speech recognition,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 8, no. 6, pp. 74-85, 2024, doi: https://doi.org/10.9781/ijimai.2024.04.003
[11] D. Mulfari, G. Meoni, M. Marini, L. Fanucci, “Machine learning assistive application for users with speech disorders,” Applied Soft Computing, vol. 103, p. 107147, 2021, doi: https://doi.org/10.1016/j.asoc.2021.107147
[12] N. Riccardi, S. Nelakuditi, D. B. den Ouden, C. Rorden, J. Fridriksson, R. H. Desai, “Discourse-and lesionbased aphasia quotient estimation using machine learning,” NeuroImage: Clinical, vol. 42, p. 103602, 2024, doi: https://doi.org/10.1016/j.nicl.2024.103602
[13] A. Adikari, N. Hernandez, D. Alahakoon, M. L. Rose, J. E. Pierce, “From concept to practice: a scoping review of the application of ai to aphasia diagnosis and management,” Disability and Rehabilitation, vol. 46, no. 7, pp. 1288–1297, 2024, doi: https://doi.org/10.1080/09638288.2023.2199463
[14] H. Yang, M. Zhang, S. Tao, M. Ma, Y. Qin, “Chinese asr and ner improvement based on whisper finetuning,” in 2023 25th International Conference on Advanced Communication Technology (ICACT), 2023, pp. 213–217, IEEE.
[15] J. R. Green, R. L. MacDonald, P.-P. Jiang, J. Cattiau, R. Heywood, R. Cave, K. Seaver, M. A. Ladewig, J. Tobin, M. P. Brenner, et al., “Automatic speech recognition of disordered speech: Personalized models outperforming human listeners on short phrases,” in Interspeech, 2021, pp. 4778–4782.
[16] K. Rao, H. Sak, R. Prabhavalkar, “Exploring architectures, data and units for streaming endto-end speech recognition with rnn-transducer,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 193–199, IEEE.
[17] J. Shor, D. Emanuel, O. Lang, O. Tuval, M. Brenner, J. Cattiau, F. Vieira, M. McNally, T. Charbonneau, M. Nollstadt, et al., “Personalizing asr for dysarthric and accented speech with limited data,” arXiv preprint arXiv:1907.13511, 2019, doi: https://doi.org/10.48550/arXiv.1907.13511
[18] V. B. Kumar, S. Cheng, N. Peng, Y. Zhang, “Visual information matters for asr error correction,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5, IEEE.
[19] J. Effendi, A. Tjandra, S. Sakti, S. Nakamura, “Listening while speaking and visualizing: Improving asr through multimodal chain,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 471–478, IEEE.
[20] S. Debnath, P. Roy, “Audio-visual automatic speech recognition using pzm, mfcc and statistical analysis,” International Journal of Interactive Multimedia andArtificial Intelligence, vol. 7, pp. 121–133, 2021, doi: https://doi.org/10.9781/ijimai.2021.09.001
[21] S. K. Choe, Q. Lu, V. Raunak, Y. Xu, F. Metze, “On leveraging visual modality for speech recognition error correction,” 2019.
[22] X. Chen, Y. Wang, X. Wu, D. Wang, Z. Wu, X. Liu, H. Meng, “Exploiting audio-visual features with pretrained av-hubert for multi-modal dysarthric speech reconstruction,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12341–12345, IEEE.
[23] C. Yu, X. Su, Z. Qian, “Multi-stage audiovisual fusion for dysarthric speech recognition with pre-trained models,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 31, pp. 1912–1921, 2023, doi: https://doi.org/10.1109/TNSRE.2023.3262001
[24] E. Howarth, G. Vabulas, S. Connolly, D. Green, S. S. and, “Developing accessible speech technology with users with dysarthric speech,” Assistive Technology, vol. 0, no. 0, pp. 1–8, 2024, doi: https://doi.org/10.1080/10400435.2024.2328082
[25] G. Ayoka, G. Barbareschi, R. Cave, C. Holloway, “Enhancing communication equity: evaluation of an automated speech recognition application in ghana,” in Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–16.
[26] B. MacWhinney, “The talkbank project,” Creating andDigitizing Language Corpora: Volume 1: Synchronic Databases, pp. 163–180, 2007, doi: https://doi.org/10.1057/9780230223936
[27] B. MacWhinney, D. Fromm, M. Forbes, A. Holland, “Aphasiabank: Methods for studying discourse,” Aphasiology, vol. 25, no. 11, pp. 1286–1307, 2011, doi: https://doi.org/10.1080/02687038.2011.589893
[28] J. Guo, J. Li, D. Li, A. M. H. Tiong, B. Li, D. Tao, S. Hoi, “From images to textual prompts: Zero-shot visual question answering with frozen large language models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10867–10877.
[29] Y. Hu, H. Hua, Z. Yang, W. Shi, N. A. Smith, J. Luo, “Promptcap: Promptguided image captioning for vqa with gpt-3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2963–2975.
[30] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018, doi: https://doi.org/10.48550/arXiv.1804.00015
[31] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou, C. Gong, Y. Shen, et al., “A comprehensive capability analysis of gpt-3 and gpt-3.5 series models,” arXiv preprint arXiv:2303.10420, 2023.
[32] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022, doi: https://doi.org/10.48550/arXiv.2201.08239
[33] Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang, T. Sun, “Llavar: Enhanced visual instruction tuning for text-rich image understanding,” arXiv preprint arXiv:2306.17107, 2023, doi: https://doi.org/10.48550/arXiv.2306.17107
Downloads
Published
-
Abstract44
-
PDF49






