Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking

Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking. Urteaga, I., Draïdia, M., Lancewicki, T., & Khadivi, S. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10609–10627, Toronto, Canada, July, 2023. Association for Computational Linguistics.

Paper doi abstract bibtex

We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters.We propose a multi-armed bandit framework for the sequential selection of pre-training hyperparameters, aimed at optimizing language model performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance.We empirically demonstrate how GP-TS pre-trains language models efficiently, i.e., it achieves lower MLM loss in fewer epochs, across a variety of settings. In addition, GP-TS pre-trained TLMs attain competitive downstream performance, while avoiding expensive hyperparameter grid search. GP-TS provides an interactive framework for efficient and optimized TLM pre-training that, by circumventing costly hyperparameter selection, enables substantial computational savings.

@InProceedings{Urteaga2023,
  author    = {I{\~n}igo Urteaga and Moulay-Za\"idane Dra\"idia and Tomer Lancewicki and Shahram Khadivi},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2023},
  title     = {{Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking}},
  year      = {2023},
  address   = {Toronto, Canada},
  month     = jul,
  pages     = {10609--10627},
  publisher = {Association for Computational Linguistics},
  abstract  = {We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters.We propose a multi-armed bandit framework for the sequential selection of pre-training hyperparameters, aimed at optimizing language model performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance.We empirically demonstrate how GP-TS pre-trains language models efficiently, i.e., it achieves lower MLM loss in fewer epochs, across a variety of settings. In addition, GP-TS pre-trained TLMs attain competitive downstream performance, while avoiding expensive hyperparameter grid search. GP-TS provides an interactive framework for efficient and optimized TLM pre-training that, by circumventing costly hyperparameter selection, enables substantial computational savings.},
  doi       = {10.18653/v1/2023.findings-acl.675},
  url       = {https://aclanthology.org/2023.findings-acl.675},
}

Downloads: 0

{"_id":"djXjMFZ9Z242QpfmN","bibbaseid":"urteaga-draidia-lancewicki-khadivi-multiarmedbanditsforresourceefficientonlineoptimizationoflanguagemodelpretrainingtheusecaseofdynamicmasking-2023","author_short":["Urteaga, I.","Draïdia, M.","Lancewicki, T.","Khadivi, S."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"firstnames":["Iñigo"],"propositions":[],"lastnames":["Urteaga"],"suffixes":[]},{"firstnames":["Moulay-Zaïdane"],"propositions":[],"lastnames":["Draïdia"],"suffixes":[]},{"firstnames":["Tomer"],"propositions":[],"lastnames":["Lancewicki"],"suffixes":[]},{"firstnames":["Shahram"],"propositions":[],"lastnames":["Khadivi"],"suffixes":[]}],"booktitle":"Findings of the Association for Computational Linguistics: ACL 2023","title":"Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking","year":"2023","address":"Toronto, Canada","month":"July","pages":"10609–10627","publisher":"Association for Computational Linguistics","abstract":"We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters.We propose a multi-armed bandit framework for the sequential selection of pre-training hyperparameters, aimed at optimizing language model performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance.We empirically demonstrate how GP-TS pre-trains language models efficiently, i.e., it achieves lower MLM loss in fewer epochs, across a variety of settings. In addition, GP-TS pre-trained TLMs attain competitive downstream performance, while avoiding expensive hyperparameter grid search. GP-TS provides an interactive framework for efficient and optimized TLM pre-training that, by circumventing costly hyperparameter selection, enables substantial computational savings.","doi":"10.18653/v1/2023.findings-acl.675","url":"https://aclanthology.org/2023.findings-acl.675","bibtex":"@InProceedings{Urteaga2023,\n author = {I{\\~n}igo Urteaga and Moulay-Za\\\"idane Dra\\\"idia and Tomer Lancewicki and Shahram Khadivi},\n booktitle = {Findings of the Association for Computational Linguistics: ACL 2023},\n title = {{Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking}},\n year = {2023},\n address = {Toronto, Canada},\n month = jul,\n pages = {10609--10627},\n publisher = {Association for Computational Linguistics},\n abstract = {We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters.We propose a multi-armed bandit framework for the sequential selection of pre-training hyperparameters, aimed at optimizing language model performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance.We empirically demonstrate how GP-TS pre-trains language models efficiently, i.e., it achieves lower MLM loss in fewer epochs, across a variety of settings. In addition, GP-TS pre-trained TLMs attain competitive downstream performance, while avoiding expensive hyperparameter grid search. GP-TS provides an interactive framework for efficient and optimized TLM pre-training that, by circumventing costly hyperparameter selection, enables substantial computational savings.},\n doi = {10.18653/v1/2023.findings-acl.675},\n url = {https://aclanthology.org/2023.findings-acl.675},\n}\n\n","author_short":["Urteaga, I.","Draïdia, M.","Lancewicki, T.","Khadivi, S."],"key":"Urteaga2023","id":"Urteaga2023","bibbaseid":"urteaga-draidia-lancewicki-khadivi-multiarmedbanditsforresourceefficientonlineoptimizationoflanguagemodelpretrainingtheusecaseofdynamicmasking-2023","role":"author","urls":{"Paper":"https://aclanthology.org/2023.findings-acl.675"},"metadata":{"authorlinks":{}},"html":""},"bibtype":"inproceedings","biburl":"https://iurteaga.github.io/myConferences.bib","dataSources":["c9XBPv8yTw5NucH3m"],"keywords":[],"search_terms":["multi","armed","bandits","resource","efficient","online","optimization","language","model","pre","training","use","case","dynamic","masking","urteaga","draïdia","lancewicki","khadivi"],"title":"Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking","year":2023}