The IDEFICS model was proposed in OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh
The abstract from the paper is the following:
Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks that require reasoning over one or multiple images to generate a text. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELISC, we train an 80 billion parameters vision and language model on the dataset and obtain competitive performance on various multimodal benchmarks. We release the code to reproduce the dataset along with the dataset itself.
This model was contributed by HuggingFaceM4. The original code can be found here. (TODO: don't have a public link yet).
IDEFICS modeling code in Transformers is for finetuning and inferencing the pre-trained IDEFICS models.
To train a new IDEFICS model from scratch use the m4 codebase (a link will be provided once it's made public)
[[autodoc]] IdeficsConfig
[[autodoc]] IdeficsModel - forward
[[autodoc]] IdeficsForVisionText2Text - forward
[[autodoc]] TFIdeficsModel - call
[[autodoc]] TFIdeficsForVisionText2Text - call
[[autodoc]] IdeficsImageProcessor - preprocess
[[autodoc]] IdeficsProcessor - call