Multimodal Speech Processing Using Asynchronous Hidden Markov Models. Bengio, S. Information Fusion, 5(2):81–89, 2004.
Multimodal Speech Processing Using Asynchronous Hidden Markov Models [link]Paper  abstract   bibtex   
This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to \em desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.
@article{bengio:2004:if,
  author = {S. Bengio},
  title = {Multimodal Speech Processing Using Asynchronous Hidden Markov Models},
  journal = {Information Fusion},
  volume = 5,
  number = 2,
  pages = {81--89},
  year = 2004,
  url = {publications/ps/bengio_2004_if.ps.gz},
  pdf = {publications/pdf/bengio_2004_if.pdf},
  djvu = {publications/djvu/bengio_2004_if.djvu},
  original = {2004/ahmm_if/IF02B04-RSP},
  web = {http://dx.doi.org/10.1016/j.inffus.2003.04.001},
  topics = {speech,multimodal,biometric_authentication},
  abstract = {This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to {\em desynchronize} the streams in order to maximize their joint likelihood.  We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events.  An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences.  The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.},
  categorie = {A}
}

Downloads: 0