Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound. Lee, S., Kim, M., Shin, S., Lee, D., Jang, I., & Lim, W. 2022.
Paper abstract bibtex An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more significant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.
@misc{lee_conditional_2022,
title = {Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound},
url = {https://www.semanticscholar.org/paper/Conditional-variational-autoencoder-to-improve-for-Lee-Kim/4a91af2a5a4759594a92f2ed82763ba31bc945ea},
abstract = {An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more significant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.},
language = {en},
urldate = {2022-11-18},
author = {Lee, Seokjin and Kim, Minhan and Shin, S. and Lee, Daeho and Jang, I. and Lim, Wootaek},
year = {2022},
keywords = {ReadList},
}
Downloads: 0
{"_id":"dSX2LQNFoqYdKWZk4","bibbaseid":"lee-kim-shin-lee-jang-lim-conditionalvariationalautoencodertoimproveneuralaudiosynthesisforpolyphonicmusicsound-2022","author_short":["Lee, S.","Kim, M.","Shin, S.","Lee, D.","Jang, I.","Lim, W."],"bibdata":{"bibtype":"misc","type":"misc","title":"Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound","url":"https://www.semanticscholar.org/paper/Conditional-variational-autoencoder-to-improve-for-Lee-Kim/4a91af2a5a4759594a92f2ed82763ba31bc945ea","abstract":"An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more significant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.","language":"en","urldate":"2022-11-18","author":[{"propositions":[],"lastnames":["Lee"],"firstnames":["Seokjin"],"suffixes":[]},{"propositions":[],"lastnames":["Kim"],"firstnames":["Minhan"],"suffixes":[]},{"propositions":[],"lastnames":["Shin"],"firstnames":["S."],"suffixes":[]},{"propositions":[],"lastnames":["Lee"],"firstnames":["Daeho"],"suffixes":[]},{"propositions":[],"lastnames":["Jang"],"firstnames":["I."],"suffixes":[]},{"propositions":[],"lastnames":["Lim"],"firstnames":["Wootaek"],"suffixes":[]}],"year":"2022","keywords":"ReadList","bibtex":"@misc{lee_conditional_2022,\n\ttitle = {Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound},\n\turl = {https://www.semanticscholar.org/paper/Conditional-variational-autoencoder-to-improve-for-Lee-Kim/4a91af2a5a4759594a92f2ed82763ba31bc945ea},\n\tabstract = {An enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer is proposed that exhibits a more significant performance and stability improvement than the conventional RAve model. Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in re-producing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch ac-tivation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.},\n\tlanguage = {en},\n\turldate = {2022-11-18},\n\tauthor = {Lee, Seokjin and Kim, Minhan and Shin, S. and Lee, Daeho and Jang, I. and Lim, Wootaek},\n\tyear = {2022},\n\tkeywords = {ReadList},\n}\n\n","author_short":["Lee, S.","Kim, M.","Shin, S.","Lee, D.","Jang, I.","Lim, W."],"key":"lee_conditional_2022","id":"lee_conditional_2022","bibbaseid":"lee-kim-shin-lee-jang-lim-conditionalvariationalautoencodertoimproveneuralaudiosynthesisforpolyphonicmusicsound-2022","role":"author","urls":{"Paper":"https://www.semanticscholar.org/paper/Conditional-variational-autoencoder-to-improve-for-Lee-Kim/4a91af2a5a4759594a92f2ed82763ba31bc945ea"},"keyword":["ReadList"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"https://bibbase.org/zotero/fsimonetta","dataSources":["pzyFFGWvxG2bs63zP"],"keywords":["readlist"],"search_terms":["conditional","variational","autoencoder","improve","neural","audio","synthesis","polyphonic","music","sound","lee","kim","shin","lee","jang","lim"],"title":"Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound","year":2022}