MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots

MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots. Rapado-Rincon, D., Nap, H., Smolenova, K., van Henten, E., J., & Kootstra, G. 11, 2023.

Paper

MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots [link]

Website abstract bibtex

In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr

@article{
 title = {MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots},
 type = {article},
 year = {2023},
 websites = {http://arxiv.org/abs/2311.15674},
 month = {11},
 day = {27},
 id = {f51b55fa-1ca2-35eb-8255-dee4f4094980},
 created = {2024-01-12T12:17:14.270Z},
 file_attached = {true},
 profile_id = {c3c41a69-4b45-352f-9232-4d3281e18730},
 group_id = {5ec9cc91-a5d6-3de5-82f3-3ef3d98a89c1},
 last_modified = {2024-01-12T12:17:14.778Z},
 read = {false},
 starred = {false},
 authored = {false},
 confirmed = {false},
 hidden = {false},
 folder_uuids = {2bfb8d91-9fac-46f0-bd0c-93235d01dbed},
 private_publication = {false},
 abstract = {In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr},
 bibtype = {article},
 author = {Rapado-Rincon, David and Nap, Henk and Smolenova, Katarina and van Henten, Eldert J. and Kootstra, Gert}
}

Downloads: 0

{"_id":"aH5q53Y3CrmmMecmy","bibbaseid":"rapadorincon-nap-smolenova-vanhenten-kootstra-motdetr3dsingleshotdetectionandtrackingwithtransformerstobuild3drepresentationsforagrofoodrobots-2023","author_short":["Rapado-Rincon, D.","Nap, H.","Smolenova, K.","van Henten, E., J.","Kootstra, G."],"bibdata":{"title":"MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots","type":"article","year":"2023","websites":"http://arxiv.org/abs/2311.15674","month":"11","day":"27","id":"f51b55fa-1ca2-35eb-8255-dee4f4094980","created":"2024-01-12T12:17:14.270Z","file_attached":"true","profile_id":"c3c41a69-4b45-352f-9232-4d3281e18730","group_id":"5ec9cc91-a5d6-3de5-82f3-3ef3d98a89c1","last_modified":"2024-01-12T12:17:14.778Z","read":false,"starred":false,"authored":false,"confirmed":false,"hidden":false,"folder_uuids":"2bfb8d91-9fac-46f0-bd0c-93235d01dbed","private_publication":false,"abstract":"In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr","bibtype":"article","author":"Rapado-Rincon, David and Nap, Henk and Smolenova, Katarina and van Henten, Eldert J. and Kootstra, Gert","bibtex":"@article{\n title = {MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots},\n type = {article},\n year = {2023},\n websites = {http://arxiv.org/abs/2311.15674},\n month = {11},\n day = {27},\n id = {f51b55fa-1ca2-35eb-8255-dee4f4094980},\n created = {2024-01-12T12:17:14.270Z},\n file_attached = {true},\n profile_id = {c3c41a69-4b45-352f-9232-4d3281e18730},\n group_id = {5ec9cc91-a5d6-3de5-82f3-3ef3d98a89c1},\n last_modified = {2024-01-12T12:17:14.778Z},\n read = {false},\n starred = {false},\n authored = {false},\n confirmed = {false},\n hidden = {false},\n folder_uuids = {2bfb8d91-9fac-46f0-bd0c-93235d01dbed},\n private_publication = {false},\n abstract = {In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr},\n bibtype = {article},\n author = {Rapado-Rincon, David and Nap, Henk and Smolenova, Katarina and van Henten, Eldert J. and Kootstra, Gert}\n}","author_short":["Rapado-Rincon, D.","Nap, H.","Smolenova, K.","van Henten, E., J.","Kootstra, G."],"urls":{"Paper":"https://bibbase.org/service/mendeley/bfbbf840-4c42-3914-a463-19024f50b30c/file/444ea196-4a19-5d49-25f0-9f52351e895a/231115674.pdf.pdf","Website":"http://arxiv.org/abs/2311.15674"},"biburl":"https://bibbase.org/service/mendeley/bfbbf840-4c42-3914-a463-19024f50b30c","bibbaseid":"rapadorincon-nap-smolenova-vanhenten-kootstra-motdetr3dsingleshotdetectionandtrackingwithtransformerstobuild3drepresentationsforagrofoodrobots-2023","role":"author","metadata":{"authorlinks":{}},"downloads":0},"bibtype":"article","biburl":"https://bibbase.org/service/mendeley/bfbbf840-4c42-3914-a463-19024f50b30c","dataSources":["2252seNhipfTmjEBQ"],"keywords":[],"search_terms":["mot","detr","single","shot","detection","tracking","transformers","build","representations","agro","food","robots","rapado-rincon","nap","smolenova","van henten","kootstra"],"title":"MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots","year":2023}