An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis. Toda, T.; Kawai, H.; Tsuzaki, M.; and Shikano, K. Speech Communication, 48(1):45-56, January.
An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis [link]Paper  doi  abstract   bibtex   
In this paper, we evaluate various cost functions for selecting a segment sequence in terms of the correspondence between the cost and perceptual scores to the naturalness of synthetic speech. The results demonstrate that the conventional average cost, which shows the degradation of naturalness over the entire synthetic utterance, has better correspondence to the perceptual scores than the maximum cost, which shows the worst local degradation of naturalness. Furthermore, it is shown that root mean square (RMS) cost, which takes into account both the average cost and the maximum cost, has the best correspondence. We also show that the naturalness of synthetic speech can be improved by using the RMS cost for segment selection. Then, we investigate the effects of applying the RMS cost to segment selection in comparison to those of applying the average cost. Experimental results show that in segment selection based on the RMS cost, a larger number of concatenations causing slight local degradation are performed so that concatenations causing greater local degradation are avoided.
@article{toda_evaluation_2006,
	Author = {Toda, Tomoki and Kawai, Hisashi and Tsuzaki, Minoru and Shikano, Kiyohiro},
	Date = {2006},
	Date-Modified = {2017-04-19 08:04:09 +0000},
	Doi = {10.1016/j.specom.2005.05.011},
	Issn = {01676393},
	Journal = {Speech Communication},
	Keywords = {speech synthesis, speech technology, text-to-speech, unit selection},
	Month = jan,
	Number = {1},
	Pages = {45-56},
	Title = {An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis},
	Url = {http://linkinghub.elsevier.com/retrieve/pii/S0167639305001366},
	Volume = {48},
	Abstract = {In this paper, we evaluate various cost functions for selecting a segment sequence in terms of the correspondence between the cost and perceptual scores to the naturalness of synthetic speech. The results demonstrate that the conventional average cost, which shows the degradation of naturalness over the entire synthetic utterance, has better correspondence to the perceptual scores than the maximum cost, which shows the worst local degradation of naturalness. Furthermore, it is shown that root mean square (RMS) cost, which takes into account both the average cost and the maximum cost, has the best correspondence. We also show that the naturalness of synthetic speech can be improved by using the RMS cost for segment selection. Then, we investigate the effects of applying the RMS cost to segment selection in comparison to those of applying the average cost. Experimental results show that in segment selection based on the RMS cost, a larger number of concatenations causing slight local degradation are performed so that concatenations causing greater local degradation are avoided.},
	Bdsk-Url-1 = {http://linkinghub.elsevier.com/retrieve/pii/S0167639305001366},
	Bdsk-Url-2 = {http://dx.doi.org/10.1016/j.specom.2005.05.011}}
Downloads: 0