Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking. Urteaga, I., Draïdia, M., Lancewicki, T., & Khadivi, S. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10609–10627, Toronto, Canada, July, 2023. Association for Computational Linguistics.
Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking [link]Paper  doi  abstract   bibtex   
We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters.We propose a multi-armed bandit framework for the sequential selection of pre-training hyperparameters, aimed at optimizing language model performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance.We empirically demonstrate how GP-TS pre-trains language models efficiently, i.e., it achieves lower MLM loss in fewer epochs, across a variety of settings. In addition, GP-TS pre-trained TLMs attain competitive downstream performance, while avoiding expensive hyperparameter grid search. GP-TS provides an interactive framework for efficient and optimized TLM pre-training that, by circumventing costly hyperparameter selection, enables substantial computational savings.
@InProceedings{Urteaga2023,
  author    = {I{\~n}igo Urteaga and Moulay-Za\"idane Dra\"idia and Tomer Lancewicki and Shahram Khadivi},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2023},
  title     = {{Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking}},
  year      = {2023},
  address   = {Toronto, Canada},
  month     = jul,
  pages     = {10609--10627},
  publisher = {Association for Computational Linguistics},
  abstract  = {We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters.We propose a multi-armed bandit framework for the sequential selection of pre-training hyperparameters, aimed at optimizing language model performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance.We empirically demonstrate how GP-TS pre-trains language models efficiently, i.e., it achieves lower MLM loss in fewer epochs, across a variety of settings. In addition, GP-TS pre-trained TLMs attain competitive downstream performance, while avoiding expensive hyperparameter grid search. GP-TS provides an interactive framework for efficient and optimized TLM pre-training that, by circumventing costly hyperparameter selection, enables substantial computational savings.},
  doi       = {10.18653/v1/2023.findings-acl.675},
  url       = {https://aclanthology.org/2023.findings-acl.675},
}

Downloads: 0