Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound. Lee, S., Kim, M., Shin, S., Lee, D., Jang, I., & Lim, W. 2022.
Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound [link]Paper  abstract   bibtex   
An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more significant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.
@misc{lee_conditional_2022,
	title = {Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound},
	url = {https://www.semanticscholar.org/paper/Conditional-variational-autoencoder-to-improve-for-Lee-Kim/4a91af2a5a4759594a92f2ed82763ba31bc945ea},
	abstract = {An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more significant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.},
	language = {en},
	urldate = {2022-11-18},
	author = {Lee, Seokjin and Kim, Minhan and Shin, S. and Lee, Daeho and Jang, I. and Lim, Wootaek},
	year = {2022},
	keywords = {ReadList},
}

Downloads: 0