Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound

Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound. Lee, S., Kim, M., Shin, S., Lee, D., Jang, I., & Lim, W. 2022.

Paper abstract bibtex

An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more signiﬁcant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been signiﬁcantly improved. However, the task of modeling raw-waveforms remains a difﬁcult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more signiﬁcant performance and stability improvement than the conventional RAVE model.

@misc{lee_conditional_2022,
	title = {Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound},
	url = {https://www.semanticscholar.org/paper/Conditional-variational-autoencoder-to-improve-for-Lee-Kim/4a91af2a5a4759594a92f2ed82763ba31bc945ea},
	abstract = {An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more signiﬁcant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been signiﬁcantly improved. However, the task of modeling raw-waveforms remains a difﬁcult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more signiﬁcant performance and stability improvement than the conventional RAVE model.},
	language = {en},
	urldate = {2022-11-18},
	author = {Lee, Seokjin and Kim, Minhan and Shin, S. and Lee, Daeho and Jang, I. and Lim, Wootaek},
	year = {2022},
	keywords = {ReadList},
}

Downloads: 0

{"_id":"dSX2LQNFoqYdKWZk4","bibbaseid":"lee-kim-shin-lee-jang-lim-conditionalvariationalautoencodertoimproveneuralaudiosynthesisforpolyphonicmusicsound-2022","author_short":["Lee, S.","Kim, M.","Shin, S.","Lee, D.","Jang, I.","Lim, W."],"bibdata":{"bibtype":"misc","type":"misc","title":"Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound","url":"https://www.semanticscholar.org/paper/Conditional-variational-autoencoder-to-improve-for-Lee-Kim/4a91af2a5a4759594a92f2ed82763ba31bc945ea","abstract":"An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more signiﬁcant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been signiﬁcantly improved. However, the task of modeling raw-waveforms remains a difﬁcult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more signiﬁcant performance and stability improvement than the conventional RAVE model.","language":"en","urldate":"2022-11-18","author":[{"propositions":[],"lastnames":["Lee"],"firstnames":["Seokjin"],"suffixes":[]},{"propositions":[],"lastnames":["Kim"],"firstnames":["Minhan"],"suffixes":[]},{"propositions":[],"lastnames":["Shin"],"firstnames":["S."],"suffixes":[]},{"propositions":[],"lastnames":["Lee"],"firstnames":["Daeho"],"suffixes":[]},{"propositions":[],"lastnames":["Jang"],"firstnames":["I."],"suffixes":[]},{"propositions":[],"lastnames":["Lim"],"firstnames":["Wootaek"],"suffixes":[]}],"year":"2022","keywords":"ReadList","bibtex":"@misc{lee_conditional_2022,\n\ttitle = {Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound},\n\turl = {https://www.semanticscholar.org/paper/Conditional-variational-autoencoder-to-improve-for-Lee-Kim/4a91af2a5a4759594a92f2ed82763ba31bc945ea},\n\tabstract = {An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more signiﬁcant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been signiﬁcantly improved. However, the task of modeling raw-waveforms remains a difﬁcult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more signiﬁcant performance and stability improvement than the conventional RAVE model.},\n\tlanguage = {en},\n\turldate = {2022-11-18},\n\tauthor = {Lee, Seokjin and Kim, Minhan and Shin, S. and Lee, Daeho and Jang, I. and Lim, Wootaek},\n\tyear = {2022},\n\tkeywords = {ReadList},\n}\n\n\n\n","author_short":["Lee, S.","Kim, M.","Shin, S.","Lee, D.","Jang, I.","Lim, W."],"key":"lee_conditional_2022","id":"lee_conditional_2022","bibbaseid":"lee-kim-shin-lee-jang-lim-conditionalvariationalautoencodertoimproveneuralaudiosynthesisforpolyphonicmusicsound-2022","role":"author","urls":{"Paper":"https://www.semanticscholar.org/paper/Conditional-variational-autoencoder-to-improve-for-Lee-Kim/4a91af2a5a4759594a92f2ed82763ba31bc945ea"},"keyword":["ReadList"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"https://bibbase.org/zotero/fsimonetta","dataSources":["pzyFFGWvxG2bs63zP"],"keywords":["readlist"],"search_terms":["conditional","variational","autoencoder","improve","neural","audio","synthesis","polyphonic","music","sound","lee","kim","shin","lee","jang","lim"],"title":"Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound","year":2022}