DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Chen, Zhuo; Huang, Yufeng; Chen, Jiaoyan; Geng, Yuxia; Zhang, Wen; Fang, Yin; Pan, Jeff Z.; Chen, Huajun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2207.01328 (cs)

[Submitted on 4 Jul 2022 (v1), last revised 16 Feb 2023 (this version, v4)]

Title:DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Authors:Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Wen Zhang, Yin Fang, Jeff Z. Pan, Huajun Chen

View PDF

Abstract:Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the shortage of fine-grained annotations, but also the attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives. We find that our DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark. Its components are effective and its predictions are interpretable.

Comments:	AAAI 2023 (Oral). Repository: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2207.01328 [cs.CV]
	(or arXiv:2207.01328v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2207.01328

Submission history

From: Zhuo Chen [view email]
[v1] Mon, 4 Jul 2022 11:12:12 UTC (7,101 KB)
[v2] Mon, 15 Aug 2022 06:51:56 UTC (13,447 KB)
[v3] Mon, 28 Nov 2022 13:47:58 UTC (13,432 KB)
[v4] Thu, 16 Feb 2023 13:04:43 UTC (13,432 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators