Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., V., & Salakhutdinov, R. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Association for Computational Linguistics (ACL), 1, 2019.

Paper

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [link]

Website abstract bibtex

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

@article{
 title = {Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},
 type = {article},
 year = {2019},
 pages = {2978-2988},
 websites = {https://arxiv.org/abs/1901.02860v3},
 month = {1},
 publisher = {Association for Computational Linguistics (ACL)},
 day = {9},
 id = {472812df-b1d0-39ea-ab85-0550ea0cefe0},
 created = {2021-08-24T07:15:44.709Z},
 accessed = {2021-08-24},
 file_attached = {true},
 profile_id = {48fc0258-023d-3602-860e-824092d62c56},
 group_id = {1ff583c0-be37-34fa-9c04-73c69437d354},
 last_modified = {2021-08-24T07:15:48.618Z},
 read = {false},
 starred = {false},
 authored = {false},
 confirmed = {false},
 hidden = {false},
 folder_uuids = {c509f25c-b687-4ab5-8859-72131b6658d3},
 private_publication = {false},
 abstract = {Transformers have a potential of learning longer-term dependency, but are
limited by a fixed-length context in the setting of language modeling. We
propose a novel neural architecture Transformer-XL that enables learning
dependency beyond a fixed length without disrupting temporal coherence. It
consists of a segment-level recurrence mechanism and a novel positional
encoding scheme. Our method not only enables capturing longer-term dependency,
but also resolves the context fragmentation problem. As a result,
Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer
than vanilla Transformers, achieves better performance on both short and long
sequences, and is up to 1,800+ times faster than vanilla Transformers during
evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity
to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion
Word, and 54.5 on Penn Treebank (without finetuning). When trained only on
WikiText-103, Transformer-XL manages to generate reasonably coherent, novel
text articles with thousands of tokens. Our code, pretrained models, and
hyperparameters are available in both Tensorflow and PyTorch.},
 bibtype = {article},
 author = {Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan},
 journal = {ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference}
}

Downloads: 0

{"_id":"bFJtSjxZnJQhjoZnB","bibbaseid":"dai-yang-yang-carbonell-le-salakhutdinov-transformerxlattentivelanguagemodelsbeyondafixedlengthcontext-2019","authorIDs":[],"author_short":["Dai, Z.","Yang, Z.","Yang, Y.","Carbonell, J.","Le, Q., V.","Salakhutdinov, R."],"bibdata":{"title":"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context","type":"article","year":"2019","pages":"2978-2988","websites":"https://arxiv.org/abs/1901.02860v3","month":"1","publisher":"Association for Computational Linguistics (ACL)","day":"9","id":"472812df-b1d0-39ea-ab85-0550ea0cefe0","created":"2021-08-24T07:15:44.709Z","accessed":"2021-08-24","file_attached":"true","profile_id":"48fc0258-023d-3602-860e-824092d62c56","group_id":"1ff583c0-be37-34fa-9c04-73c69437d354","last_modified":"2021-08-24T07:15:48.618Z","read":false,"starred":false,"authored":false,"confirmed":false,"hidden":false,"folder_uuids":"c509f25c-b687-4ab5-8859-72131b6658d3","private_publication":false,"abstract":"Transformers have a potential of learning longer-term dependency, but are\nlimited by a fixed-length context in the setting of language modeling. We\npropose a novel neural architecture Transformer-XL that enables learning\ndependency beyond a fixed length without disrupting temporal coherence. It\nconsists of a segment-level recurrence mechanism and a novel positional\nencoding scheme. Our method not only enables capturing longer-term dependency,\nbut also resolves the context fragmentation problem. As a result,\nTransformer-XL learns dependency that is 80% longer than RNNs and 450% longer\nthan vanilla Transformers, achieves better performance on both short and long\nsequences, and is up to 1,800+ times faster than vanilla Transformers during\nevaluation. Notably, we improve the state-of-the-art results of bpc/perplexity\nto 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion\nWord, and 54.5 on Penn Treebank (without finetuning). When trained only on\nWikiText-103, Transformer-XL manages to generate reasonably coherent, novel\ntext articles with thousands of tokens. Our code, pretrained models, and\nhyperparameters are available in both Tensorflow and PyTorch.","bibtype":"article","author":"Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan","journal":"ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference","bibtex":"@article{\n title = {Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},\n type = {article},\n year = {2019},\n pages = {2978-2988},\n websites = {https://arxiv.org/abs/1901.02860v3},\n month = {1},\n publisher = {Association for Computational Linguistics (ACL)},\n day = {9},\n id = {472812df-b1d0-39ea-ab85-0550ea0cefe0},\n created = {2021-08-24T07:15:44.709Z},\n accessed = {2021-08-24},\n file_attached = {true},\n profile_id = {48fc0258-023d-3602-860e-824092d62c56},\n group_id = {1ff583c0-be37-34fa-9c04-73c69437d354},\n last_modified = {2021-08-24T07:15:48.618Z},\n read = {false},\n starred = {false},\n authored = {false},\n confirmed = {false},\n hidden = {false},\n folder_uuids = {c509f25c-b687-4ab5-8859-72131b6658d3},\n private_publication = {false},\n abstract = {Transformers have a potential of learning longer-term dependency, but are\nlimited by a fixed-length context in the setting of language modeling. We\npropose a novel neural architecture Transformer-XL that enables learning\ndependency beyond a fixed length without disrupting temporal coherence. It\nconsists of a segment-level recurrence mechanism and a novel positional\nencoding scheme. Our method not only enables capturing longer-term dependency,\nbut also resolves the context fragmentation problem. As a result,\nTransformer-XL learns dependency that is 80% longer than RNNs and 450% longer\nthan vanilla Transformers, achieves better performance on both short and long\nsequences, and is up to 1,800+ times faster than vanilla Transformers during\nevaluation. Notably, we improve the state-of-the-art results of bpc/perplexity\nto 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion\nWord, and 54.5 on Penn Treebank (without finetuning). When trained only on\nWikiText-103, Transformer-XL manages to generate reasonably coherent, novel\ntext articles with thousands of tokens. Our code, pretrained models, and\nhyperparameters are available in both Tensorflow and PyTorch.},\n bibtype = {article},\n author = {Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan},\n journal = {ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference}\n}","author_short":["Dai, Z.","Yang, Z.","Yang, Y.","Carbonell, J.","Le, Q., V.","Salakhutdinov, R."],"urls":{"Paper":"https://bibbase.org/service/mendeley/bfbbf840-4c42-3914-a463-19024f50b30c/file/0a43cbc2-e1c1-6765-6f4b-1bfab1930cee/full_text.pdf.pdf","Website":"https://arxiv.org/abs/1901.02860v3"},"biburl":"https://bibbase.org/service/mendeley/bfbbf840-4c42-3914-a463-19024f50b30c","bibbaseid":"dai-yang-yang-carbonell-le-salakhutdinov-transformerxlattentivelanguagemodelsbeyondafixedlengthcontext-2019","role":"author","metadata":{"authorlinks":{}},"downloads":0},"bibtype":"article","biburl":"https://bibbase.org/service/mendeley/bfbbf840-4c42-3914-a463-19024f50b30c","creationDate":"2021-02-12T21:37:01.811Z","downloads":0,"keywords":[],"search_terms":["transformer","attentive","language","models","beyond","fixed","length","context","dai","yang","yang","carbonell","le","salakhutdinov"],"title":"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context","year":2019,"dataSources":["qLJ7Ld8T2ZKybATHB","QGwcHf7xnb5mCCQi7","ya2CyA73rpZseyrZ8","aXmRAq63YsH7a3ufx","2252seNhipfTmjEBQ"]}