Product HMMs for audio-visual continuous speech recognition using facial animation parameters

Product HMMs for audio-visual continuous speech recognition using facial animation parameters. Aleksic, P. & Katsaggelos, A. In 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698), volume 2, pages II–481, 2003. IEEE, IEEE.

Paper doi abstract bibtex

The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.

@inproceedings{aleksic2003product,
abstract = {The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.},
author = {Aleksic, P.S. and Katsaggelos, A.K.},
booktitle = {2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698)},
doi = {10.1109/ICME.2003.1221658},
isbn = {0-7803-7965-9},
issn = {1945788X},
organization = {IEEE},
pages = {II--481},
publisher = {IEEE},
title = {{Product HMMs for audio-visual continuous speech recognition using facial animation parameters}},
url = {http://ieeexplore.ieee.org/document/1221658/},
volume = {2},
year = {2003}
}

Downloads: 0

{"_id":"m9Kh9DmngFAGaEgWa","bibbaseid":"aleksic-katsaggelos-producthmmsforaudiovisualcontinuousspeechrecognitionusingfacialanimationparameters-2003","author_short":["Aleksic, P.","Katsaggelos, A."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","abstract":"The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.","author":[{"propositions":[],"lastnames":["Aleksic"],"firstnames":["P.S."],"suffixes":[]},{"propositions":[],"lastnames":["Katsaggelos"],"firstnames":["A.K."],"suffixes":[]}],"booktitle":"2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698)","doi":"10.1109/ICME.2003.1221658","isbn":"0-7803-7965-9","issn":"1945788X","organization":"IEEE","pages":"II–481","publisher":"IEEE","title":"Product HMMs for audio-visual continuous speech recognition using facial animation parameters","url":"http://ieeexplore.ieee.org/document/1221658/","volume":"2","year":"2003","bibtex":"@inproceedings{aleksic2003product,\nabstract = {The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.},\nauthor = {Aleksic, P.S. and Katsaggelos, A.K.},\nbooktitle = {2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698)},\ndoi = {10.1109/ICME.2003.1221658},\nisbn = {0-7803-7965-9},\nissn = {1945788X},\norganization = {IEEE},\npages = {II--481},\npublisher = {IEEE},\ntitle = {{Product HMMs for audio-visual continuous speech recognition using facial animation parameters}},\nurl = {http://ieeexplore.ieee.org/document/1221658/},\nvolume = {2},\nyear = {2003}\n}\n","author_short":["Aleksic, P.","Katsaggelos, A."],"key":"aleksic2003product","id":"aleksic2003product","bibbaseid":"aleksic-katsaggelos-producthmmsforaudiovisualcontinuousspeechrecognitionusingfacialanimationparameters-2003","role":"author","urls":{"Paper":"http://ieeexplore.ieee.org/document/1221658/"},"metadata":{"authorlinks":{}},"html":""},"bibtype":"inproceedings","biburl":"https://sites.northwestern.edu/ivpl/files/2023/06/IVPL_Updated_publications-1.bib","dataSources":["KTWAakbPXLGfYseXn","ePKPjG8C6yvpk4mEK","37Qfzv6wRptYkoSCL","zFPgsTDAW8aDnb5iN","E6Bth2QB5BYjBMZE7","nbnEjsN7MJhurAK9x","PNQZj6FjzoxxJk4Yi","7FpDWDGJ4KgpDiGfB","bod9ms4MQJHuJgPpp","QR9t5P2cLdJuzhfzK","D8k2SxfC5dKNRFgro","7Dwzbxq93HWrJEhT6","qhF8zxmGcJfvtdeAg","fvDEHD49E2ZRwE3fb","H7crv8NWhZup4d4by","DHqokWsryttGh7pJE","vRJd4wNg9HpoZSMHD","sYxQ6pxFgA59JRhxi","w2WahSbYrbcCKBDsC","XasdXLL99y5rygCmq","3gkSihZQRfAD2KBo3","t5XMbyZbtPBo4wBGS","bEpHM2CtrwW2qE8FP","teJzFLHexaz5AQW5z","taz5xnPrcQTmMdtqr"],"keywords":[],"search_terms":["product","hmms","audio","visual","continuous","speech","recognition","using","facial","animation","parameters","aleksic","katsaggelos"],"title":"Product HMMs for audio-visual continuous speech recognition using facial animation parameters","year":2003}