Multimodal Speech Processing Using Asynchronous Hidden Markov Models. Bengio, S. *Information Fusion*, 5(2):81–89, 2004.

This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to \em desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

@article{bengio:2004:if, author = {S. Bengio}, title = {Multimodal Speech Processing Using Asynchronous Hidden Markov Models}, journal = {Information Fusion}, volume = 5, number = 2, pages = {81--89}, year = 2004, url = {publications/ps/bengio_2004_if.ps.gz}, pdf = {publications/pdf/bengio_2004_if.pdf}, djvu = {publications/djvu/bengio_2004_if.djvu}, original = {2004/ahmm_if/IF02B04-RSP}, web = {http://dx.doi.org/10.1016/j.inffus.2003.04.001}, topics = {speech,multimodal,biometric_authentication}, abstract = {This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to {\em desynchronize} the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.}, categorie = {A} }

