MaskOCR: text recognition with masked encoder-decoder pretraining

MaskOCR: text recognition with masked encoder-decoder pretraining. Lyu, P., Zhang, C., Liu, S., Qiao, M., Xu, Y., Wu, L., Yao, K., Han, J., Ding, E., & Wang, J. 2022. Publisher: arXiv Version Number: 1

Paper doi abstract bibtex

In this paper, we present a model pretraining technique, named MaskOCR, for text recognition. Our text recognition architecture is an encoder-decoder transformer: the encoder extracts the patch-level representations, and the decoder recognizes the text from the representations. Our approach pretrains both the encoder and the decoder in a sequential manner. (i) We pretrain the encoder in a self-supervised manner over a large set of unlabeled real text images. We adopt the masked image modeling approach, which shows the effectiveness for general images, expecting that the representations take on semantics. (ii) We pretrain the decoder over a large set of synthesized text images in a supervised manner and enhance the language modeling capability of the decoder by randomly masking some text image patches occupied by characters input to the encoder and accordingly the representations input to the decoder. Experiments show that the proposed MaskOCR approach achieves superior results on the benchmark datasets, including Chinese and English text images.

@misc{lyu_maskocr_2022,
	title = {{MaskOCR}: text recognition with masked encoder-decoder pretraining},
	copyright = {arXiv.org perpetual, non-exclusive license},
	shorttitle = {{MaskOCR}},
	url = {https://arxiv.org/abs/2206.00311},
	doi = {10.48550/ARXIV.2206.00311},
	abstract = {In this paper, we present a model pretraining technique, named MaskOCR, for text recognition. Our text recognition architecture is an encoder-decoder transformer: the encoder extracts the patch-level representations, and the decoder recognizes the text from the representations. Our approach pretrains both the encoder and the decoder in a sequential manner. (i) We pretrain the encoder in a self-supervised manner over a large set of unlabeled real text images. We adopt the masked image modeling approach, which shows the effectiveness for general images, expecting that the representations take on semantics. (ii) We pretrain the decoder over a large set of synthesized text images in a supervised manner and enhance the language modeling capability of the decoder by randomly masking some text image patches occupied by characters input to the encoder and accordingly the representations input to the decoder. Experiments show that the proposed MaskOCR approach achieves superior results on the benchmark datasets, including Chinese and English text images.},
	language = {en},
	urldate = {2023-05-11},
	author = {Lyu, Pengyuan and Zhang, Chengquan and Liu, Shanshan and Qiao, Meina and Xu, Yangliu and Wu, Liang and Yao, Kun and Han, Junyu and Ding, Errui and Wang, Jingdong},
	year = {2022},
	note = {Publisher: arXiv
Version Number: 1},
	keywords = {\#nosource, Computer Science - Computer Vision and Pattern Recognition, Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences},
}

Downloads: 0

{"_id":"FmYnNNCGX3be8AFjd","bibbaseid":"lyu-zhang-liu-qiao-xu-wu-yao-han-etal-maskocrtextrecognitionwithmaskedencoderdecoderpretraining-2022","author_short":["Lyu, P.","Zhang, C.","Liu, S.","Qiao, M.","Xu, Y.","Wu, L.","Yao, K.","Han, J.","Ding, E.","Wang, J."],"bibdata":{"bibtype":"misc","type":"misc","title":"MaskOCR: text recognition with masked encoder-decoder pretraining","copyright":"arXiv.org perpetual, non-exclusive license","shorttitle":"MaskOCR","url":"https://arxiv.org/abs/2206.00311","doi":"10.48550/ARXIV.2206.00311","abstract":"In this paper, we present a model pretraining technique, named MaskOCR, for text recognition. Our text recognition architecture is an encoder-decoder transformer: the encoder extracts the patch-level representations, and the decoder recognizes the text from the representations. Our approach pretrains both the encoder and the decoder in a sequential manner. (i) We pretrain the encoder in a self-supervised manner over a large set of unlabeled real text images. We adopt the masked image modeling approach, which shows the effectiveness for general images, expecting that the representations take on semantics. (ii) We pretrain the decoder over a large set of synthesized text images in a supervised manner and enhance the language modeling capability of the decoder by randomly masking some text image patches occupied by characters input to the encoder and accordingly the representations input to the decoder. Experiments show that the proposed MaskOCR approach achieves superior results on the benchmark datasets, including Chinese and English text images.","language":"en","urldate":"2023-05-11","author":[{"propositions":[],"lastnames":["Lyu"],"firstnames":["Pengyuan"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Chengquan"],"suffixes":[]},{"propositions":[],"lastnames":["Liu"],"firstnames":["Shanshan"],"suffixes":[]},{"propositions":[],"lastnames":["Qiao"],"firstnames":["Meina"],"suffixes":[]},{"propositions":[],"lastnames":["Xu"],"firstnames":["Yangliu"],"suffixes":[]},{"propositions":[],"lastnames":["Wu"],"firstnames":["Liang"],"suffixes":[]},{"propositions":[],"lastnames":["Yao"],"firstnames":["Kun"],"suffixes":[]},{"propositions":[],"lastnames":["Han"],"firstnames":["Junyu"],"suffixes":[]},{"propositions":[],"lastnames":["Ding"],"firstnames":["Errui"],"suffixes":[]},{"propositions":[],"lastnames":["Wang"],"firstnames":["Jingdong"],"suffixes":[]}],"year":"2022","note":"Publisher: arXiv Version Number: 1","keywords":"#nosource, Computer Science - Computer Vision and Pattern Recognition, Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences","bibtex":"@misc{lyu_maskocr_2022,\n\ttitle = {{MaskOCR}: text recognition with masked encoder-decoder pretraining},\n\tcopyright = {arXiv.org perpetual, non-exclusive license},\n\tshorttitle = {{MaskOCR}},\n\turl = {https://arxiv.org/abs/2206.00311},\n\tdoi = {10.48550/ARXIV.2206.00311},\n\tabstract = {In this paper, we present a model pretraining technique, named MaskOCR, for text recognition. Our text recognition architecture is an encoder-decoder transformer: the encoder extracts the patch-level representations, and the decoder recognizes the text from the representations. Our approach pretrains both the encoder and the decoder in a sequential manner. (i) We pretrain the encoder in a self-supervised manner over a large set of unlabeled real text images. We adopt the masked image modeling approach, which shows the effectiveness for general images, expecting that the representations take on semantics. (ii) We pretrain the decoder over a large set of synthesized text images in a supervised manner and enhance the language modeling capability of the decoder by randomly masking some text image patches occupied by characters input to the encoder and accordingly the representations input to the decoder. Experiments show that the proposed MaskOCR approach achieves superior results on the benchmark datasets, including Chinese and English text images.},\n\tlanguage = {en},\n\turldate = {2023-05-11},\n\tauthor = {Lyu, Pengyuan and Zhang, Chengquan and Liu, Shanshan and Qiao, Meina and Xu, Yangliu and Wu, Liang and Yao, Kun and Han, Junyu and Ding, Errui and Wang, Jingdong},\n\tyear = {2022},\n\tnote = {Publisher: arXiv\nVersion Number: 1},\n\tkeywords = {\\#nosource, Computer Science - Computer Vision and Pattern Recognition, Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences},\n}\n\n\n\n","author_short":["Lyu, P.","Zhang, C.","Liu, S.","Qiao, M.","Xu, Y.","Wu, L.","Yao, K.","Han, J.","Ding, E.","Wang, J."],"key":"lyu_maskocr_2022","id":"lyu_maskocr_2022","bibbaseid":"lyu-zhang-liu-qiao-xu-wu-yao-han-etal-maskocrtextrecognitionwithmaskedencoderdecoderpretraining-2022","role":"author","urls":{"Paper":"https://arxiv.org/abs/2206.00311"},"keyword":["#nosource","Computer Science - Computer Vision and Pattern Recognition","Computer Vision and Pattern Recognition (cs.CV)","FOS: Computer and information sciences"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"https://bibbase.org/zotero/fsimonetta","dataSources":["7PTj9uCrffbdGYthn","pzyFFGWvxG2bs63zP"],"keywords":["#nosource","computer science - computer vision and pattern recognition","computer vision and pattern recognition (cs.cv)","fos: computer and information sciences"],"search_terms":["maskocr","text","recognition","masked","encoder","decoder","pretraining","lyu","zhang","liu","qiao","xu","wu","yao","han","ding","wang"],"title":"MaskOCR: text recognition with masked encoder-decoder pretraining","year":2022}