Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., V., & Salakhutdinov, R. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Association for Computational Linguistics (ACL), 1, 2019.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [link]Website  abstract   bibtex   
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.
@article{
 title = {Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},
 type = {article},
 year = {2019},
 pages = {2978-2988},
 websites = {https://arxiv.org/abs/1901.02860v3},
 month = {1},
 publisher = {Association for Computational Linguistics (ACL)},
 day = {9},
 id = {472812df-b1d0-39ea-ab85-0550ea0cefe0},
 created = {2021-08-24T07:15:44.709Z},
 accessed = {2021-08-24},
 file_attached = {true},
 profile_id = {48fc0258-023d-3602-860e-824092d62c56},
 group_id = {1ff583c0-be37-34fa-9c04-73c69437d354},
 last_modified = {2021-08-24T07:15:48.618Z},
 read = {false},
 starred = {false},
 authored = {false},
 confirmed = {false},
 hidden = {false},
 folder_uuids = {c509f25c-b687-4ab5-8859-72131b6658d3},
 private_publication = {false},
 abstract = {Transformers have a potential of learning longer-term dependency, but are
limited by a fixed-length context in the setting of language modeling. We
propose a novel neural architecture Transformer-XL that enables learning
dependency beyond a fixed length without disrupting temporal coherence. It
consists of a segment-level recurrence mechanism and a novel positional
encoding scheme. Our method not only enables capturing longer-term dependency,
but also resolves the context fragmentation problem. As a result,
Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer
than vanilla Transformers, achieves better performance on both short and long
sequences, and is up to 1,800+ times faster than vanilla Transformers during
evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity
to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion
Word, and 54.5 on Penn Treebank (without finetuning). When trained only on
WikiText-103, Transformer-XL manages to generate reasonably coherent, novel
text articles with thousands of tokens. Our code, pretrained models, and
hyperparameters are available in both Tensorflow and PyTorch.},
 bibtype = {article},
 author = {Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan},
 journal = {ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference}
}

Downloads: 0