Product HMMs for audio-visual continuous speech recognition using facial animation parameters. Aleksic, P. & Katsaggelos, A. In 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698), volume 2, pages II–481, 2003. IEEE, IEEE.
Product HMMs for audio-visual continuous speech recognition using facial animation parameters [link]Paper  doi  abstract   bibtex   
The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.
@inproceedings{aleksic2003product,
abstract = {The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.},
author = {Aleksic, P.S. and Katsaggelos, A.K.},
booktitle = {2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698)},
doi = {10.1109/ICME.2003.1221658},
isbn = {0-7803-7965-9},
issn = {1945788X},
organization = {IEEE},
pages = {II--481},
publisher = {IEEE},
title = {{Product HMMs for audio-visual continuous speech recognition using facial animation parameters}},
url = {http://ieeexplore.ieee.org/document/1221658/},
volume = {2},
year = {2003}
}

Downloads: 0