TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance. Tao, Y., Jia, Z., Ma, R., & Xu, S. Electronics, 10(22):2780, January, 2021. Number: 22 Publisher: Multidisciplinary Digital Publishing Institute

Paper doi abstract bibtex

Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main short-comings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose to use a learnable initial embedding learned from the transformer encoder to make it adaptive to different input images. Above all, we introduce a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG), composed of three stages (transformation, feature extraction, and prediction). Extensive experiments show that our approach can achieve state-of-the-art on text recognition benchmarks.

@article{tao_trig_2021,
	title = {{TRIG}: {Transformer}-{Based} {Text} {Recognizer} with {Initial} {Embedding} {Guidance}},
	volume = {10},
	copyright = {http://creativecommons.org/licenses/by/3.0/},
	issn = {2079-9292},
	shorttitle = {{TRIG}},
	url = {https://www.mdpi.com/2079-9292/10/22/2780},
	doi = {10.3390/electronics10222780},
	abstract = {Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main short-comings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose to use a learnable initial embedding learned from the transformer encoder to make it adaptive to different input images. Above all, we introduce a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG), composed of three stages (transformation, feature extraction, and prediction). Extensive experiments show that our approach can achieve state-of-the-art on text recognition benchmarks.},
	language = {en},
	number = {22},
	urldate = {2023-09-29},
	journal = {Electronics},
	author = {Tao, Yue and Jia, Zhiwei and Ma, Runze and Xu, Shugong},
	month = jan,
	year = {2021},
	note = {Number: 22
Publisher: Multidisciplinary Digital Publishing Institute},
	keywords = {1-D split, initial embedding, scene text recognition, self-attention, transformer},
	pages = {2780},
}

Downloads: 0

{"_id":"KpveQ3AKgg9jstL8b","bibbaseid":"tao-jia-ma-xu-trigtransformerbasedtextrecognizerwithinitialembeddingguidance-2021","author_short":["Tao, Y.","Jia, Z.","Ma, R.","Xu, S."],"bibdata":{"bibtype":"article","type":"article","title":"TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance","volume":"10","copyright":"http://creativecommons.org/licenses/by/3.0/","issn":"2079-9292","shorttitle":"TRIG","url":"https://www.mdpi.com/2079-9292/10/22/2780","doi":"10.3390/electronics10222780","abstract":"Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main short-comings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose to use a learnable initial embedding learned from the transformer encoder to make it adaptive to different input images. Above all, we introduce a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG), composed of three stages (transformation, feature extraction, and prediction). Extensive experiments show that our approach can achieve state-of-the-art on text recognition benchmarks.","language":"en","number":"22","urldate":"2023-09-29","journal":"Electronics","author":[{"propositions":[],"lastnames":["Tao"],"firstnames":["Yue"],"suffixes":[]},{"propositions":[],"lastnames":["Jia"],"firstnames":["Zhiwei"],"suffixes":[]},{"propositions":[],"lastnames":["Ma"],"firstnames":["Runze"],"suffixes":[]},{"propositions":[],"lastnames":["Xu"],"firstnames":["Shugong"],"suffixes":[]}],"month":"January","year":"2021","note":"Number: 22 Publisher: Multidisciplinary Digital Publishing Institute","keywords":"1-D split, initial embedding, scene text recognition, self-attention, transformer","pages":"2780","bibtex":"@article{tao_trig_2021,\n\ttitle = {{TRIG}: {Transformer}-{Based} {Text} {Recognizer} with {Initial} {Embedding} {Guidance}},\n\tvolume = {10},\n\tcopyright = {http://creativecommons.org/licenses/by/3.0/},\n\tissn = {2079-9292},\n\tshorttitle = {{TRIG}},\n\turl = {https://www.mdpi.com/2079-9292/10/22/2780},\n\tdoi = {10.3390/electronics10222780},\n\tabstract = {Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main short-comings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose to use a learnable initial embedding learned from the transformer encoder to make it adaptive to different input images. Above all, we introduce a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG), composed of three stages (transformation, feature extraction, and prediction). Extensive experiments show that our approach can achieve state-of-the-art on text recognition benchmarks.},\n\tlanguage = {en},\n\tnumber = {22},\n\turldate = {2023-09-29},\n\tjournal = {Electronics},\n\tauthor = {Tao, Yue and Jia, Zhiwei and Ma, Runze and Xu, Shugong},\n\tmonth = jan,\n\tyear = {2021},\n\tnote = {Number: 22\nPublisher: Multidisciplinary Digital Publishing Institute},\n\tkeywords = {1-D split, initial embedding, scene text recognition, self-attention, transformer},\n\tpages = {2780},\n}\n\n","author_short":["Tao, Y.","Jia, Z.","Ma, R.","Xu, S."],"key":"tao_trig_2021","id":"tao_trig_2021","bibbaseid":"tao-jia-ma-xu-trigtransformerbasedtextrecognizerwithinitialembeddingguidance-2021","role":"author","urls":{"Paper":"https://www.mdpi.com/2079-9292/10/22/2780"},"keyword":["1-D split","initial embedding","scene text recognition","self-attention","transformer"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://api.zotero.org/groups/2386895/collections/7PPRTB2H/items?format=bibtex&limit=100","dataSources":["u8q5uny4m5jJL9RcX"],"keywords":["1-d split","initial embedding","scene text recognition","self-attention","transformer"],"search_terms":["trig","transformer","based","text","recognizer","initial","embedding","guidance","tao","jia","ma","xu"],"title":"TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance","year":2021}