Masked-attention Mask Transformer for Universal Image Segmentation

Masked-attention Mask Transformer for Universal Image Segmentation. Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, June, 2022. ISSN: 2575-7075
doi abstract bibtex

Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing spe-cialized architectures for each task. We present Masked- attention Mask Transformer (Mask2Former), a new archi-tecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components in-clude masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most no-tably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU onADE20K).

@inproceedings{cheng_masked-attention_2022,
	title = {Masked-attention {Mask} {Transformer} for {Universal} {Image} {Segmentation}},
	doi = {10.1109/CVPR52688.2022.00135},
	abstract = {Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing spe-cialized architectures for each task. We present Masked- attention Mask Transformer (Mask2Former), a new archi-tecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components in-clude masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most no-tably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU onADE20K).},
	language = {en},
	booktitle = {2022 {IEEE}/{CVF} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} ({CVPR})},
	author = {Cheng, Bowen and Misra, Ishan and Schwing, Alexander G. and Kirillov, Alexander and Girdhar, Rohit},
	month = jun,
	year = {2022},
	note = {ISSN: 2575-7075},
	keywords = {\#Attention, \#CVPR{\textgreater}22, \#Segmentation, \#Transformer, \#Vision, /unread, Computational modeling, Computer architecture, Feature extraction, Image segmentation, Recognition: detection, Segmentation, Semantics, Shape, Transformers, categorization, grouping and shape analysis, retrieval, ❤️},
	pages = {1280--1289},
}

Downloads: 0

{"_id":"uL9hRv87ZFhgPubmQ","bibbaseid":"cheng-misra-schwing-kirillov-girdhar-maskedattentionmasktransformerforuniversalimagesegmentation-2022","author_short":["Cheng, B.","Misra, I.","Schwing, A. G.","Kirillov, A.","Girdhar, R."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","title":"Masked-attention Mask Transformer for Universal Image Segmentation","doi":"10.1109/CVPR52688.2022.00135","abstract":"Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing spe-cialized architectures for each task. We present Masked- attention Mask Transformer (Mask2Former), a new archi-tecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components in-clude masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most no-tably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU onADE20K).","language":"en","booktitle":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":[{"propositions":[],"lastnames":["Cheng"],"firstnames":["Bowen"],"suffixes":[]},{"propositions":[],"lastnames":["Misra"],"firstnames":["Ishan"],"suffixes":[]},{"propositions":[],"lastnames":["Schwing"],"firstnames":["Alexander","G."],"suffixes":[]},{"propositions":[],"lastnames":["Kirillov"],"firstnames":["Alexander"],"suffixes":[]},{"propositions":[],"lastnames":["Girdhar"],"firstnames":["Rohit"],"suffixes":[]}],"month":"June","year":"2022","note":"ISSN: 2575-7075","keywords":"#Attention, #CVPR\\textgreater22, #Segmentation, #Transformer, #Vision, /unread, Computational modeling, Computer architecture, Feature extraction, Image segmentation, Recognition: detection, Segmentation, Semantics, Shape, Transformers, categorization, grouping and shape analysis, retrieval, ❤️","pages":"1280–1289","bibtex":"@inproceedings{cheng_masked-attention_2022,\n\ttitle = {Masked-attention {Mask} {Transformer} for {Universal} {Image} {Segmentation}},\n\tdoi = {10.1109/CVPR52688.2022.00135},\n\tabstract = {Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing spe-cialized architectures for each task. We present Masked- attention Mask Transformer (Mask2Former), a new archi-tecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components in-clude masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most no-tably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU onADE20K).},\n\tlanguage = {en},\n\tbooktitle = {2022 {IEEE}/{CVF} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} ({CVPR})},\n\tauthor = {Cheng, Bowen and Misra, Ishan and Schwing, Alexander G. and Kirillov, Alexander and Girdhar, Rohit},\n\tmonth = jun,\n\tyear = {2022},\n\tnote = {ISSN: 2575-7075},\n\tkeywords = {\\#Attention, \\#CVPR{\\textgreater}22, \\#Segmentation, \\#Transformer, \\#Vision, /unread, Computational modeling, Computer architecture, Feature extraction, Image segmentation, Recognition: detection, Segmentation, Semantics, Shape, Transformers, categorization, grouping and shape analysis, retrieval, ❤️},\n\tpages = {1280--1289},\n}\n\n\n\n","author_short":["Cheng, B.","Misra, I.","Schwing, A. G.","Kirillov, A.","Girdhar, R."],"key":"cheng_masked-attention_2022","id":"cheng_masked-attention_2022","bibbaseid":"cheng-misra-schwing-kirillov-girdhar-maskedattentionmasktransformerforuniversalimagesegmentation-2022","role":"author","urls":{},"keyword":["#Attention","#CVPR\\textgreater22","#Segmentation","#Transformer","#Vision","/unread","Computational modeling","Computer architecture","Feature extraction","Image segmentation","Recognition: detection","Segmentation","Semantics","Shape","Transformers","categorization","grouping and shape analysis","retrieval","❤️"],"metadata":{"authorlinks":{}},"downloads":0,"html":""},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero/zzhenry2012","dataSources":["nZHrFJKyxKKDaWYM8"],"keywords":["#attention","#cvpr\\textgreater22","#segmentation","#transformer","#vision","/unread","computational modeling","computer architecture","feature extraction","image segmentation","recognition: detection","segmentation","semantics","shape","transformers","categorization","grouping and shape analysis","retrieval","❤️"],"search_terms":["masked","attention","mask","transformer","universal","image","segmentation","cheng","misra","schwing","kirillov","girdhar"],"title":"Masked-attention Mask Transformer for Universal Image Segmentation","year":2022}