TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., & Wei, F. September, 2022. arXiv:2109.10282 [cs]
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models [link]Paper  doi  abstract   bibtex   
Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at \url\https://aka.ms/trocr\.
@misc{li_trocr_2022,
	title = {{TrOCR}: {Transformer}-based {Optical} {Character} {Recognition} with {Pre}-trained {Models}},
	shorttitle = {{TrOCR}},
	url = {http://arxiv.org/abs/2109.10282},
	doi = {10.48550/arXiv.2109.10282},
	abstract = {Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at {\textbackslash}url\{https://aka.ms/trocr\}.},
	urldate = {2023-05-11},
	publisher = {arXiv},
	author = {Li, Minghao and Lv, Tengchao and Chen, Jingye and Cui, Lei and Lu, Yijuan and Florencio, Dinei and Zhang, Cha and Li, Zhoujun and Wei, Furu},
	month = sep,
	year = {2022},
	note = {arXiv:2109.10282 [cs]},
	keywords = {Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Remember},
}

Downloads: 0