Pitch prediction from Mel-generalized cepstrum — a computationally efficient pitch modeling approach for speech synthesis. Rao, M. V. A. & Ghosh, P. K. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 1629-1633, Aug, 2017.
doi  abstract   bibtex   
Text-to-speech (TTS) systems are often used as part of the user interface in wearable devices. Due to limited memory and computational/battery power in wearable devices, it could be useful to have a TTS system which requires less memory and is less computationally intensive. Conventional speech synthesis systems has separate modeling for pitch (FO-model) and spectral representation, namely Mel generalized coefficients (MGC) (MGC-model). In this paper we estimate pitch from the MGC estimated using MGC-model instead of having a separate FO-model. Pitch is obtained from the estimated MGC using a statistical mapping through Gaussian mixture model (GMM). Experiments using CMU-ARCTIC database demonstrate that the proposed GMM based FO-model, even with a single mixture, results in no significant loss in the naturalness of the synthesized speech while the proposed FO-model, in addition to reducing computational complexity, results in  93% reduction in the number of parameters compared to that of the F0-model.
@InProceedings{8081485,
  author = {M. V. A. Rao and P. K. Ghosh},
  booktitle = {2017 25th European Signal Processing Conference (EUSIPCO)},
  title = {Pitch prediction from Mel-generalized cepstrum — a computationally efficient pitch modeling approach for speech synthesis},
  year = {2017},
  pages = {1629-1633},
  abstract = {Text-to-speech (TTS) systems are often used as part of the user interface in wearable devices. Due to limited memory and computational/battery power in wearable devices, it could be useful to have a TTS system which requires less memory and is less computationally intensive. Conventional speech synthesis systems has separate modeling for pitch (FO-model) and spectral representation, namely Mel generalized coefficients (MGC) (MGC-model). In this paper we estimate pitch from the MGC estimated using MGC-model instead of having a separate FO-model. Pitch is obtained from the estimated MGC using a statistical mapping through Gaussian mixture model (GMM). Experiments using CMU-ARCTIC database demonstrate that the proposed GMM based FO-model, even with a single mixture, results in no significant loss in the naturalness of the synthesized speech while the proposed FO-model, in addition to reducing computational complexity, results in ~93% reduction in the number of parameters compared to that of the F0-model.},
  keywords = {cepstral analysis;computational complexity;Gaussian processes;mixture models;signal representation;speech processing;speech synthesis;speech-based user interfaces;pitch prediction;Mel-generalized cepstrum;computationally efficient pitch modeling approach;user interface;wearable devices;TTS system;spectral representation;Mel generalized coefficients;MGC-model;estimated MGC;Gaussian mixture model;GMM based FO-model;computational complexity;F0-model;text-to-speech synthesis systems;statistical mapping;CMU-ARCTIC database;pitch estimation;Hidden Markov models;High-temperature superconductors;Speech;Computational modeling;Speech synthesis;Training;Covariance matrices},
  doi = {10.23919/EUSIPCO.2017.8081485},
  issn = {2076-1465},
  month = {Aug},
}

Downloads: 0