Revisiting SincNet: An Evaluation of Feature and Network Hyperparameters for Speaker Recognition. Oneață, D., Georgescu, L., Cucu, H., Burileanu, D., & Burileanu, C. In 2020 28th European Signal Processing Conference (EUSIPCO), pages 1-5, Aug, 2020.
Revisiting SincNet: An Evaluation of Feature and Network Hyperparameters for Speaker Recognition [pdf]Paper  doi  abstract   bibtex   
The SincNet architecture [1] was recently introduced as an approach to the speaker recognition task. Its main innovation was the sinc layer—an elegant and lightweight way of extracting features from speech. Despite good performance on multiple datasets, little information was provided on the architectural choices. In this work, we aim to shed some light on the importance of the network topology and various hyperparameters. We replace the original network trunk with a lightweight trunk inspired from residual networks (ResNets) and optimize its hyperparameters. Furthermore, we carry an extensive study on the sinc layer’s hyperparameters. Our main finding is that the stride and window size of the feature extractor plays a crucial role in obtaining good performance. Further experiments on conventional features, such as MFCCs and FBANKs, yield similar conclusions; in fact, by using optimal values for these two hyperparameters, traditional features are able to match the performance of sinc features. Surprisingly, the best results obtained go against conventional wisdom: an analysis window of only a couple of milliseconds and a stride of only a couple of samples are found to give the best results. Our code is available at https://bitbucket.org/doneata/sincnet.
@InProceedings{9287794,
  author = {D. Oneață and L. Georgescu and H. Cucu and D. Burileanu and C. Burileanu},
  booktitle = {2020 28th European Signal Processing Conference (EUSIPCO)},
  title = {Revisiting SincNet: An Evaluation of Feature and Network Hyperparameters for Speaker Recognition},
  year = {2020},
  pages = {1-5},
  abstract = {The SincNet architecture [1] was recently introduced as an approach to the speaker recognition task. Its main innovation was the sinc layer—an elegant and lightweight way of extracting features from speech. Despite good performance on multiple datasets, little information was provided on the architectural choices. In this work, we aim to shed some light on the importance of the network topology and various hyperparameters. We replace the original network trunk with a lightweight trunk inspired from residual networks (ResNets) and optimize its hyperparameters. Furthermore, we carry an extensive study on the sinc layer’s hyperparameters. Our main finding is that the stride and window size of the feature extractor plays a crucial role in obtaining good performance. Further experiments on conventional features, such as MFCCs and FBANKs, yield similar conclusions; in fact, by using optimal values for these two hyperparameters, traditional features are able to match the performance of sinc features. Surprisingly, the best results obtained go against conventional wisdom: an analysis window of only a couple of milliseconds and a stride of only a couple of samples are found to give the best results. Our code is available at https://bitbucket.org/doneata/sincnet.},
  keywords = {Technological innovation;Signal processing;Feature extraction;Speaker recognition;Task analysis;Optimization;Residual neural networks;deep learning;speaker recognition;features;hyperparameter optimization},
  doi = {10.23919/Eusipco47968.2020.9287794},
  issn = {2076-1465},
  month = {Aug},
  url = {https://www.eurasip.org/proceedings/eusipco/eusipco2020/pdfs/0000361.pdf},
}
Downloads: 0