Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R.

Paper abstract bibtex

Transformer networks have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. As a solution, we propose a novel neural architecture, Transformer-XL, that enables Transformer to learn dependency beyond a fixed length without disrupting temporal coherence. Concretely, it consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the problem of context fragmentation. As a result, Transformer-XL learns dependency that is about 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformer during evaluation. Additionally, we improve the state-of-the-art (SoTA) results of bpc/perplexity from 1.06 to 0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (without finetuning). Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

@article{daiTransformerXLAttentiveLanguage2019,
  archivePrefix = {arXiv},
  eprinttype = {arxiv},
  eprint = {1901.02860},
  primaryClass = {cs, stat},
  title = {Transformer-{{XL}}: {{Attentive Language Models Beyond}} a {{Fixed}}-{{Length Context}}},
  url = {http://arxiv.org/abs/1901.02860},
  shorttitle = {Transformer-{{XL}}},
  abstract = {Transformer networks have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. As a solution, we propose a novel neural architecture, Transformer-XL, that enables Transformer to learn dependency beyond a fixed length without disrupting temporal coherence. Concretely, it consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the problem of context fragmentation. As a result, Transformer-XL learns dependency that is about 80\% longer than RNNs and 450\% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformer during evaluation. Additionally, we improve the state-of-the-art (SoTA) results of bpc/perplexity from 1.06 to 0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (without finetuning). Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.},
  urldate = {2019-01-30},
  date = {2019-01-09},
  keywords = {Statistics - Machine Learning,Computer Science - Computation and Language,Computer Science - Machine Learning},
  author = {Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan},
  file = {/home/dimitri/Nextcloud/Zotero/storage/SR686JDP/Dai et al. - 2019 - Transformer-XL Attentive Language Models Beyond a.pdf;/home/dimitri/Nextcloud/Zotero/storage/CHWMM3TL/1901.html}
}

Downloads: 0

{"_id":"ARhvjhi33wi2euAiG","bibbaseid":"dai-yang-yang-carbonell-le-salakhutdinov-transformerxlattentivelanguagemodelsbeyondafixedlengthcontext","authorIDs":[],"author_short":["Dai, Z.","Yang, Z.","Yang, Y.","Carbonell, J.","Le, Q. V.","Salakhutdinov, R."],"bibdata":{"bibtype":"article","type":"article","archiveprefix":"arXiv","eprinttype":"arxiv","eprint":"1901.02860","primaryclass":"cs, stat","title":"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context","url":"http://arxiv.org/abs/1901.02860","shorttitle":"Transformer-XL","abstract":"Transformer networks have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. As a solution, we propose a novel neural architecture, Transformer-XL, that enables Transformer to learn dependency beyond a fixed length without disrupting temporal coherence. Concretely, it consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the problem of context fragmentation. As a result, Transformer-XL learns dependency that is about 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformer during evaluation. Additionally, we improve the state-of-the-art (SoTA) results of bpc/perplexity from 1.06 to 0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (without finetuning). Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.","urldate":"2019-01-30","date":"2019-01-09","keywords":"Statistics - Machine Learning,Computer Science - Computation and Language,Computer Science - Machine Learning","author":[{"propositions":[],"lastnames":["Dai"],"firstnames":["Zihang"],"suffixes":[]},{"propositions":[],"lastnames":["Yang"],"firstnames":["Zhilin"],"suffixes":[]},{"propositions":[],"lastnames":["Yang"],"firstnames":["Yiming"],"suffixes":[]},{"propositions":[],"lastnames":["Carbonell"],"firstnames":["Jaime"],"suffixes":[]},{"propositions":[],"lastnames":["Le"],"firstnames":["Quoc","V."],"suffixes":[]},{"propositions":[],"lastnames":["Salakhutdinov"],"firstnames":["Ruslan"],"suffixes":[]}],"file":"/home/dimitri/Nextcloud/Zotero/storage/SR686JDP/Dai et al. - 2019 - Transformer-XL Attentive Language Models Beyond a.pdf;/home/dimitri/Nextcloud/Zotero/storage/CHWMM3TL/1901.html","bibtex":"@article{daiTransformerXLAttentiveLanguage2019,\n archivePrefix = {arXiv},\n eprinttype = {arxiv},\n eprint = {1901.02860},\n primaryClass = {cs, stat},\n title = {Transformer-{{XL}}: {{Attentive Language Models Beyond}} a {{Fixed}}-{{Length Context}}},\n url = {http://arxiv.org/abs/1901.02860},\n shorttitle = {Transformer-{{XL}}},\n abstract = {Transformer networks have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. As a solution, we propose a novel neural architecture, Transformer-XL, that enables Transformer to learn dependency beyond a fixed length without disrupting temporal coherence. Concretely, it consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the problem of context fragmentation. As a result, Transformer-XL learns dependency that is about 80\\% longer than RNNs and 450\\% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformer during evaluation. Additionally, we improve the state-of-the-art (SoTA) results of bpc/perplexity from 1.06 to 0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (without finetuning). Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.},\n urldate = {2019-01-30},\n date = {2019-01-09},\n keywords = {Statistics - Machine Learning,Computer Science - Computation and Language,Computer Science - Machine Learning},\n author = {Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan},\n file = {/home/dimitri/Nextcloud/Zotero/storage/SR686JDP/Dai et al. - 2019 - Transformer-XL Attentive Language Models Beyond a.pdf;/home/dimitri/Nextcloud/Zotero/storage/CHWMM3TL/1901.html}\n}\n\n","author_short":["Dai, Z.","Yang, Z.","Yang, Y.","Carbonell, J.","Le, Q. V.","Salakhutdinov, R."],"key":"daiTransformerXLAttentiveLanguage2019","id":"daiTransformerXLAttentiveLanguage2019","bibbaseid":"dai-yang-yang-carbonell-le-salakhutdinov-transformerxlattentivelanguagemodelsbeyondafixedlengthcontext","role":"author","urls":{"Paper":"http://arxiv.org/abs/1901.02860"},"keyword":["Statistics - Machine Learning","Computer Science - Computation and Language","Computer Science - Machine Learning"],"downloads":0},"bibtype":"article","biburl":"https://raw.githubusercontent.com/dlozeve/newblog/master/bib/all.bib","creationDate":"2020-01-08T20:39:39.242Z","downloads":0,"keywords":["statistics - machine learning","computer science - computation and language","computer science - machine learning"],"search_terms":["transformer","attentive","language","models","beyond","fixed","length","context","dai","yang","yang","carbonell","le","salakhutdinov"],"title":"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context","year":null,"dataSources":["3XqdvqRE7zuX4cm8m"]}