Scaling Open-Vocabulary Object Detection

Minderer, Matthias; Gritsenko, Alexey; Houlsby, Neil

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.09683 (cs)

[Submitted on 16 Jun 2023 (v1), last revised 22 May 2024 (this version, v3)]

Title:Scaling Open-Vocabulary Object Detection

Authors:Matthias Minderer, Alexey Gritsenko, Neil Houlsby

View PDF HTML (experimental)

Abstract:Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.09683 [cs.CV]
	(or arXiv:2306.09683v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.09683

Submission history

From: Alexey Gritsenko [view email]
[v1] Fri, 16 Jun 2023 08:27:46 UTC (1,910 KB)
[v2] Thu, 20 Jul 2023 12:23:12 UTC (1,910 KB)
[v3] Wed, 22 May 2024 13:00:02 UTC (1,910 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Open-Vocabulary Object Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Open-Vocabulary Object Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators